Towards a Theoretical Understanding of Hashing-Based Neural Nets

12/26/2018 ∙ by Yibo Lin, et al. ∙ Princeton University The University of Texas at Austin Harvard University 0

Parameter reduction has been an important topic in deep learning due to the ever-increasing size of deep neural network models and the need to train and run them on resource limited machines. Despite many efforts in this area, there were no rigorous theoretical guarantees on why existing neural net compression methods should work. In this paper, we provide provable guarantees on some hashing-based parameter reduction methods in neural nets. First, we introduce a neural net compression scheme based on random linear sketching (which is usually implemented efficiently via hashing), and show that the sketched (smaller) network is able to approximate the original network on all input data coming from any smooth and well-conditioned low-dimensional manifold. The sketched network can also be trained directly via back-propagation. Next, we study the previously proposed HashedNets architecture and show that the optimization landscape of one-hidden-layer HashedNets has a local strong convexity property similar to a normal fully connected neural network. We complement our theoretical results with empirical verifications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past decade, deep neural networks have become the new standards for many machine learning applications, including computer vision

Krizhevsky et al. (2012); He et al. (2016)

, natural language processing

Zaremba et al. (2014); Gehring et al. (2017), speech recognition Graves et al. (2013); Amodei et al. (2016), robotics Lillicrap et al. (2015), game playing Silver et al. (2016, 2017), etc. Such model usually contains an enormous number of parameters, which is often much larger than the number of available training samples. Therefore, these networks are usually trained on modern computer clusters which have a huge amount of memory and computation power. On the other hand, there is an increasing need to train and run personalized machine learning models on mobile and embedded devices instead of transferring mobile data to a remote computation center on which all the computations are performed. This is because real-time processing of deep learning models on mobile devices brings the benefits of better privacy and less Internet bandwidth. However, mobile devices like smart phones do not have the memory or computation capability of training large neural networks or even storing these models.

These trends motivate the study of neural network compression, with the goal of reducing the memory overhead required to train, store and run neural networks. There is a recent line of research in this direction, for example Chen et al. (2015); Iandola et al. (2016); Han et al. (2016). Despite their empirical effectiveness, there is little theoretical understanding on why these methods perform well.

The goal of this paper is to bridge the gap between theory and practice in neural network compression. Our focus is on hashing-based methods, which have been studied empirically in e.g. Chen et al. (2015, 2016), with the hope that the randomness in hash functions helps preserve the properties of neural networks despite a reduction in the number of effective parameters. We make this intuition formal by giving theoretical guarantees on the approximation power and the parameter recovery of such networks.

First, we propose a neural net compression scheme based on random linear sketching, which can be efficiently implemented using a hash function. Similar idea has been proposed in Kasiviswanathan et al. (2017) and demonstrated high performance empirically, but no formal theoretical guarantee was known. We show that such compression has strong approximation power. Namely, the small network obtained after sketching can approximate the original network on all input data coming from any low-dimensional manifold with some regularity properties. The sketched network is also directly trainable via back-propagation. In fact, sketching is a principled technique for dimensionality reduction, which has been shown to be very powerful in solving various problems arising in statistics Raskutti and Mahoney (2016); Wang et al. (2017) and numerical linear algebra Woodruff (2014). Given its theoretical success, it is natural to ask whether sketching can be applied to the context of neural net compression with theoretical guarantees. Our result makes partial progresses on this question.

Next we study HashedNets, a simple method proposed in Chen et al. (2015) which appears to perform well in practice. HashedNets directly applies a random hash function on the connection weights in a neural net and to enforce all the weights mapped to the same hash bucket to take the same value. In this way the number of trainable parameters is reduced to be the number of different hash buckets, and training can still be performed via back-propagation while taking the weight sharing structure into account. From the perspective of optimization, we show that the training objective for a one-hidden-layer hashed neural net has a local strong convexity property, similar to that of a normal fully connected network Zhong et al. (2017b). Additionally, we can apply the initialization algorithm in Zhong et al. (2017b) to obtain a good initialization for training. Therefore it implies that the parameters in one-hidden-layer HashedNets can be provably recovered under milde assumptions.

Below we describe our contributions in more detail.

Approximation Power

Our result on the approximation power of sketched nets is based on a classical concept, “subspace embedding”, which originally appears in numerical linear algebra Sarlós (2006). Roughly speaking, it says that there exist a wide family of random matrices , such that for any -dimensional subspace

, with probability

we have for all , provided . This result means that the inner product between every two points in a subspace can be approximated simultaneously after applying a random sketching matrix , which is interesting if . There has been a line of work trying to do subspace embedding using different sketching matrices (e.g. Nelson and Nguyên (2013); Cohen (2016)). Sparse matrices are of particular interests, since for a sparse matrix , one can compute more efficiently. For example, Nelson and Nguyên (2013) showed that it is possible to construct with only nonzero entries per column, which significantly improves the trivial upper bound . Furthermore, many of these sketching matrices can be efficiently implemented by -wise independent hash functions where is very small, which only takes a small amount of space to store, and multiplying

with a vector can be computed efficiently.

We extend the idea of subspace embedding to deep learning and show that a feed-forward fully connected network with Lipschitz-continuous activation functions can be approximated using random sketching on all input data coming from a low-dimensional subspace. Below we describe our result for one-hidden-layer neural nets, and this can be generalized to multiple layers.

Consider a one-hidden-layer neural net with input dimension and hidden nodes. It can be parameterized by a weight matrix and a weight vector , and the function this network computes is , where is the input, and should be viewed as a nonlinear activation function acting coordinate-wise on a vector. Our result says that under appropriate assumptions, one can choose a random sketching matrix , such that for any -dimensional subspace , we have

This result essentially says that the weight matrix can be replaced by , which has rank . When , this means that the effective number of parameters can be reduced from to . As we mentioned, the sketching matrix can be implemented by hash functions in small space and multiplying it with a vector is efficient. The sketched network is also directly trainable, because we can train the matrix , regarding another factor in the decomposition as a known layer.

This result can be generalized to multi-layer neural nets, and we present the details in Section 3. We also note that our result can be easily generalized to low-dimensional manifolds under some regularity condition (see Definition 2.3 in Baraniuk and Wakin (2009)), which is a much more realistic assumption on data.

Parameter Recovery.

It is known that training a neural net is NP-hard in the worst case, even if it only has hidden nodes Blum and Rivest (1993). Recently, there has been some theoretical progress on understanding the optimization landscapes of shallow neural nets under special input distributions. In particular, Zhong et al. (2017b) gave a recovery guarantee for one-hidden-layer neural nets. They showed that if the input distribution is Gaussian and the ground-truth weight vectors corresponding to hidden nodes are linearly independent, then the true parameters can be recovered in polynomial time given finite samples. This was proved by showing that the training objective is locally strongly convex and smooth around the ground-truth point, together with an initialization method that can output a point inside the locally “nice” region. In this work, we show that local strong convexity and smoothness continue to hold if we replace the fully connected network by HashedNets which has a weight sharing structure enforced by a hash function. We present this result in Section 4.

1.1 Related Works

Parameter Reduction in Deep Learning.

There has been a series of empirical works on reducing the number of free parameters in deep neural networks: Denil et al. (2013) show a method to learn low-rank decompositions of weight matrices in each layer, Chen et al. (2015) propose an approach to use a hash function to enforce parameter sharing, Cheng et al. (2015) adopt a circulant matrix structure for parameter reduction, Sindhwani et al. (2015) study a more general class of structured matrices for parameter reduction.

Sketching and Neural Networks.

Daniely et al. (2016) show that any linear or sparse polynomial function on sparse binary data can be computed by a small single-layer neural net on a linear sketch of the data. Kasiviswanathan et al. (2017)

apply a random sketching on weight matrices/tensors, but they only prove that given a fixed layer input, the output of this layer using sketching matrices is an unbiased estimator of the original output of this layer and has bounded variance; however, this does not provide guarantees on the approximation power of the whole sketching-based deep net.

Subspace Embedding.

Subspace embedding Sarlós (2006)

is a fundamental tool for solving numerical linear algebra problems, e.g. linear regression, matrix low-rank approximation

Clarkson and Woodruff (2013); Nelson and Nguyên (2013); Razenshteyn et al. (2016); Song et al. (2017b), tensor low-rank approximation Song et al. (2019). See also Woodruff (2014) for a survey on this topic.

Recovery Guarantee of Neural Networks.

Since learning a neural net is NP-hard in the worst case Blum and Rivest (1993), many attempts have been made to design algorithms that learns a neural net provably in polynomial time and sample complexity under additional assumptions, e.g., Sedghi and Anandkumar (2014); Zhang et al. (2015); Janzamin et al. (2015); Goel et al. (2017); Goel and Klivans (2017a, b)

. Another line of work focused on analyzing (stochastic) gradient descent on shallow networks for Gaussian input distributions, e.g.,

Brutzkus and Globerson (2017); Zhong et al. (2017a, b); Tian (2017); Li and Yuan (2017); Du et al. (2017); Soltanolkotabi (2017).

Other Related Works

Instead of understanding the parameter reduction as our work, there are several results working on developing over-parameterization theory of deep ReLU neural networks, e.g.

Allen-Zhu et al. (2018a, b). Thirty years ago, Blum and Rivest proved training neural network is NP-hard Blum and Rivest (1993). Later, neural networks have been shown hard in several different perspectives Klivans and Sherstov (2009); Livni et al. (2014); Daniely (2016); Daniely and Shalev-Shwartz (2016); Goel et al. (2017); Song et al. (2017a); Katz et al. (2017); Weng et al. (2018); Manurangsi and Reichman (2018) in the worst case regime.

Arora et al. proved a stronger generalization for deep nets via a compression approach Arora et al. (2018). There is a long line of works targeting on explaining GAN from theoretical perspective Arora and Zhang (2017); Arora et al. (2017b, a); Bora et al. (2017); Li et al. (2018); Santurkar et al. (2018); Van Veen et al. (2018); Xiao et al. (2018). There is also a long line of provable results about adversarial examples Madry et al. (2017); Bubeck et al. (2018b, a); Weng et al. (2018); Schmidt et al. (2018); Tran et al. (2018).

2 Preliminaries

For any positive integer , we use to denote the set . Let represent any number in the interval . For any vector , we use , and to denote its , and norms, respectively. For , we use to denote the standard Euclidean inner product .

For a matrix , let denote its determinant (if is a square matrix), let denote the Moore-Penrose pseudoinverse of , and let and denote respectively the Frobenius norm and the spectral norm of . Denote by the

-th largest singular value of

. We use to denote the number of non-zero entries in .

For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants .

We define the and balls in as: We also need the definitions of Lipschitz-continuous functions and -wise independent hash families.

Definition 2.1.

A function is -Lipshitz continuous, if for all ,

Definition 2.2.

A family of hash functions is said to be -wise independent if for any and any we have

3 The Approximation Power of Parameter-Reduced Neural Networks

In this section, we study the approximation power of parameter-reduced neural nets based on hashing. Any weight matrix in a neural net acts on a vector as . We replace by for some sketching (rows columns) matrix defined in the following section. Then the new weight matrix has much fewer parameters. We show that if is chosen properly as a subspace embedding (formally defined later in this section), the sketched network can approximate the original network on all inputs coming from a low-dimensional subspace or manifold. Our sketching matrix is chosen as a Johnson-Lindenstrauss (JL) Johnson and Lindenstrauss (1984) transformation matrix. In Section 3.1, we provide some preliminaries on subspace embedding. In Sections 3.2 and 3.3, we present our result on one-hidden-layer neural nets. Then in Section 3.4 we extend this result to multi-layer neural nets and show a similar approximation guarantee. This provides a theoretical guarantee for hashing-based parameter-reduced networks used in practice.

3.1 Subspace Embedding

We first present some basic definitions of sketching and subspace embedding. These mathematical tools are building blocks for us to understand parameter-reduced neural networks.

Definition 3.1 (Subspace Embedding).

A -subspace embedding for the column space of an matrix is a matrix for which for all , or equivalently, for all ,

Constructions of subspace embedding can be found in e.g. Nelson and Nguyên (2013) from which there is the following theorem.

Theorem 3.2 (Nelson and Nguyên (2013)).

There is a oblivious111The construction of is oblivious to the subspace . -subspace embedding for matrix with rows and error probability . Further, can be computed in time . We call a SparseEmbedding matrix.

There are also other subspace embedding matrices, e.g., CountSketch. We provide additional definitions and examples in Section B.1.

Remark 3.3.

We remark that the subspace embedding in Definition 3.1 naturally extends to low dimensional manifolds. For example, for a -dimensional Riemannian submanifold of with volumn and geodesic covering regularity (see Definition 2.3 in Baraniuk and Wakin (2009)), Theorem 3.2 holds by replacing with . For ease of presentation, we only present our results for subspaces. All our results can be extended to low-dimensional manifolds satisfying regularity conditions.

3.2 One Hidden Layer - Part i@

We consider one-hidden-layer neural nets in the form , where is the input vector, is a weight vector, is a weight scalar, and is a nonlinear activation function. In this subsection, we show how to sketch the weights between the input layer and the hidden layer with guaranteed approximation power. The main result is Theorem 3.4 and its proofs are in Appendix B.

Theorem 3.4.

Given parameters and . Given activation functions that are -Lipshitz-continuous, a fixed matrix , weight matrix with , with . Choose a SparseEmbedding matrix with , then with probability , we have : for all ,

3.3 One-hidden layer - Part ii@

In this section, we show the approximation power of the compressed network if the weight matrices of both the input layer and output layer are sketched. One of the core idea in the proof is a recursive -net argument, which plays a crucial role in extending the result to multiple hidden layer. The goal of this section is to prove the following theorem and present the recursive -net argument.

Theorem 3.5.

Given parameters and . Given activation functions with -Lipshitz and normalized by , a fixed matrix , and weight matrix with , with . Choose a SparseEmbedding matrix and with , then with probability , we have : for all ,

The high level idea is as follows. Firstly, we prove that for any fixed input , the theorem statement holds with high probability. Then we build an sufficiently fine -net over the input space of and argue that our statement holds for every input point from the -net. Condition on this event, the statement holds by applying the Lipshitz continuity of the activation function. The detailed proof is presented in Appendix B.

3.4 Multiple hidden layer

In this section, we generalize our approximation power result to a multi-layer neural network and delay the proofs to Appendix B

. Inspired by the batch normalization

Ioffe and Szegedy (2015), which has been widely used in practice222https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization, we make an additional assumption by requiring the activations to be normalized by at each layer . The way we deal with multiple hidden layers is, first recursively argue an -net can be constructed for all the layers with the same size. Then we use triangle inequality to split error into terms and bounding them separately. The result is the following theorem.

Theorem 3.6.

Given parameters , and . For each , for each let denote an activation function with -Lipshitz and normalized by . Given a fixed matrix , weight matrices with (the -th column of ) , a weight vector with . For each , we choose a SparseEmbedding matrix with

Then with probability , we have : for all ,

where and are defined inductively. The base case is and the inductive case is

Note that similar results also hold for the case without using . In other words, we only choose matrices for hidden layers.

4 Recovery Guarantee

In this section, we study the recovery guarantee of parameter-reduced neural nets. In particular, we study whether (stochastic) gradient descent can learn the true parameters in a one-hidden-layer HashedNets when starting from a sufficiently good initialization point, under appropriate assumptions. We show that even under the special weight sharing structure depicted by the hash function, the resulting neural net still has sufficiently nice properties - namely, local strong convexity and smoothness around the minimizer. Our proof technique is by reducing our case to that of the fully connected network studied in Zhong et al. (2017b). After that, the recovery guarantee follows similarly. We present our result here and give the detailed proof in Appendix C.

We consider the following regression problem : given a set of samples Let be an underlying distribution over with parameters and , such that each sample is sampled i.i.d. from this distribution, with Here is a random hash function drawn from a -wise independent hash family , where , and is an activation function.

Note that has a corresponding matrix defined as , which is the actual weight matrix in the HashedNets with a weight sharing structure.

Our goal is to recover the ground-truth parameters given the sample . Note that how to recover has been discussed in Zhong et al. (2017a, b); their method also applies to our situation. Therefore we focus on recovering in this section, assuming is known.

For a given weight vector , we define its expected risk and empirical risk as

We first show a structural result for -wise independent hash family, which says the pre-image of each bucket is pretty balanced.

Lemma 4.1 (Concentration of hashing buckets, part of Lemma c.12).

Given integers and . Let denote a -wise independent hash function such that Then, if , with probability at least , we have for all ,

The previous work Zhong et al. (2017b) showed that a fully connected network whose ground-truth weight matrix has rank has local strong convexity and smoothness around

in its loss function (see their Lemma D.3).

Using Lemma 4.1 as our core reduction tool, we can reduce HashedNets to a fully connected net and obtain the following result:

Theorem 4.2 (Local strong convexity and smoothness).

Suppose . Then we have

where and are positive parameters that depend on and the activation function .

Remark 4.3.

A crucial assumption in Theorem 4.2 is that the weight matrix has rank . In Section 5, we use numerical experiment to verify this assumption in learned HashedNets.

For the empirical risk , we can show that its Hessian at the optimal point also satisfies similar properties given enough samples. See Theorem C.7 for details.

Using the tensor initialization method in Zhong et al. (2017b), we can find a point in the locally “nice region” around , and then we can show that gradient descent on the empirical risk function converges linearly to . The result is summarized as follows.

Theorem 4.4 (Recovery guarantee).

There exist parameters and that depend on and such that the following is true. Let be any point satisfying and let denote a set of i.i.d. samples from the distribution . Define where and are the same ones in Theorem 4.2. For any , if we choose and perform gradient descent with step size on and obtain the next iterate, then with probability at least , we have

The above theorem states that once a constantly-accurate initialization point is specified, we can obtain a solution up to precision in a polynomial number of gradient descent iterations. This concludes the recovery guarantee. We give the formal statements and proofs in Section C.

5 Experiments

In this section, we perform some simple experiments on MNIST dataset to evaluate the performance of HashedNets, as well as empirically verify the full rank assumption (as in Theorem 4.2) on weight matrices in HashedNets. Each image in MNIST dataset has a dimensionality of . The HashedNets in the experiment have single-hidden-layer, i.e., two fully connected layers. To validate the effectiveness of HashedNets, we construct two baselines.

  • SmallNets. A single-hidden-layer network is constructed with the same amount of effective weights as that of HashedNets. For example, for a HashedNets with hidden units in the hidden layer with compression ratio 64, a corresponding SmallNets have hidden units in the hidden layer.

  • ThinNets. A two-hidden-layer network is constructed with the same amount of effective weights as that of HashedNets. By replacing the first fully connected layer in HashedNets with a thin hidden layer, a same amount of weights can be achieved. For example, for a HashedNets with hidden units in the hidden layer with compression ratio 64, a corresponding ThinNets have hidden units for the first hidden layer and 1000 hidden units for the second hidden layer.

The accuracy of HashedNets, SmallNets, and ThinNets is compared under various compression ratios.

Figure 1: (a) Ratio; (b) ratio; (c) ratio; (d) ratio. We run two one-hidden layer algorithms on MNIST dataset. Comparison of accuracy distribution with different random seeds for HashedNets and SmallNets. Choose 50 random seeds in total. HashedNets have 1000 hidden units in this case.
Figure 2: (a) Ratio; (b) ratio; (c) ratio; (d) ratio. The testing error during training of different networks with random seed 100.

The HashedNets were implemented in Torch7 Collobert et al. (2011); Chen et al. (2015)

and validated on NVIDIA GTX 1080 GPU. We used 32 bit precision floating point numbers throughout the experiments. Stochastic gradient descent was adopted as the numerical optimizer with a dropout keep rate of 0.9, momentum of 0.9, and a batch size of 50. ReLU was used as the activation function. We ran 1000 epochs for each experiment and experiments on two single-hidden-layer HashedNets with 500 and 1000 hidden units are conducted, respectively. The amount of units in SmallNets and ThinNets is adjusted to match the amount of weights in HashedNets.

Figure 3: Input dimension is 784. Distributions of singular values, condition numbers, and stable ranks for two weight matrices and in HashedNets with 1000 hidden units for 50 random seeds.

For different compression ratios, we plot the distribution of testing errors for 50 runs of HashedNets, ThinNets, and SmallNets with 50 different random seeds, as shown in Figure 1. Due to random initialization, SmallNets still gives different results with independent runs. In Figure 1

(a), when the compression ratio is 1, which indicating no compression, the distributions for both HashedNets and SmallNets are very close, i.e., with means of 1.37% and 1.40%, standard deviations of 0.050% and 0.038%, respectively. ThinNets provides slightly better testing error with a mean of 1.27% and a standard deviation of 0.057%. In Figure 

1(b), when the compression ratio is 8, HashedNets provides smaller testing errors than that of SmallNets, i.e., with means of 1.43% v.s. 1.76%, and standard deviations of 0.052% v.s. 0.070%, respectively. ThinNets provides slightly better testing errors than that of HashedNets, i.e., with means of 1.32% v.s. 1.44%, and standard deviations of 0.060% v.s. 0.056%. In other words, both HashedNets and ThinNets can achieve higher and more robust accuracy with improvements in both mean and standard deviation than SmallNets for this compression ratio. With the compression ratio increasing to 64, as shown in Figure 1(c), the benefit of HashedNets is more significant. The mean of testing errors for HashedNets degrades to 2.80%, while that for SmallNets increases to 6.09%. The errors for SmallNets are more instable due to larger standard deviations. Meanwhile, HashedNets can also provide better accuracy than ThinNets, i.e., with means of 2.80% v.s. 5.03%, and standard deviations of 0.090% v.s. 0.196%. When the compression ratio is 128, as shown in Figure 1(d), HashedNets achieves a mean accuracy of 4.20% and a standard deviation of 0.116%, which is much better than that of ThinNets, a mean accuracy of 11.09% and a standard deviation of 0.160%, and that of SmallNets, a mean accuracy of 10.28% and a standard deviation of 0.810%. In summary, from the aspect of accuracy degradation, when the compression ratio increases from 1 to 128, there is on average 2.83% degradation in accuracy for HashedNets, while the accuracy of SmallNets degrades by 8.88% and that of ThinNets degrades by 9.82%. ThinNets may achieve comparable error to HashedNet for small compression ratios (e.g., 1 and 8), while for large compression ratio, HashedNet tends to be more stable.

Figure 2 plots the training curves of different networks under different compression ratios with random seed 100. The testing errors align with the observation from Figure 1. That is, HashedNets provides high and stable accuracy across various compression ratios; ThinNets achieves good accuracy for small compression ratios (e.g., 1 and 8, the accuracy is close to that of HashedNets), while degrades significantly with the increase of compression ratios; SmallNets are also very sensitive to large compression ratios like ThinNets.

We further verify the full rank assumption of weight matrices in HashedNets. Figure 3 plots the distributions of minimum and maximum singular values, condition numbers, and stable ranks of the two weight matrices and in HashedNets with 1000 hidden units. The dimensions of is and that of is . The distributions are extracted from the aforementioned 50 runs. Figure 4 (in the Appendix) gives one example of all singular values sorted from large to small in one experiment. All the singular values and condition numbers are distributed in reasonable scales, i.e., neither too close to zero, nor too large. This experiment indicates that the assumption of full rank holds in practice. Same set of figures are also provided for HashedNets with 500 hidden units, as shown in Figure 5 (in the Appendix), where the dimensions of is and that of is .

6 Conclusion

In this paper, we study the theoretical properties of hashing-based neural networks. We show that (i) parameter-reduced neural nets have uniform approximation power on inputs from any low-dimensional subspace or smooth and well-conditioned manifold; (ii) one-hidden-layer HashedNets have similar recovery guarantee to that of fully connected neural nets. We also empirically explore an alternative compression scheme, ThinNets, which is a very interesting direction for further study, so we plan to explore its property and theoretical insights in the future.

Acknowledgments

This project was done while Zhao was visiting Harvard and hosted by Prof. Jelani Nelson. In 2017, Prof. Jelani Nelson and Prof. Piotr Indyk co-taught a class “Sketching Algorithms for Big Data”. One of the class project is, “Understanding hashing trick of neural network.” That is the initialization of this project. The authors would like to thank Zhixian Lei for inspiring us of this project.

The authors would like to appreciate Wei Hu for his generous help and contribution to this project.

The authors would like to thank Rasmus Kyng, Eric Price, Zhengyu Wang and Peilin Zhong for useful discussions.

References

Appendix

Appendix A Notation

For any positive integer , we use to denote the set

. For random variable

, let denote the expectation of (if this quantity exists). For any vector , we use to denote its norm.

We provide several definitions related to matrix . Let denote the determinant of a square matrix . Let denote the transpose of . Let denote the Moore-Penrose pseudoinverse of . Let denote the inverse of a full rank square matrix. Let denote the Frobenius norm of matrix . Let denote the spectral norm of matrix . Let to denote the -th largest singular value of .

For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants .

We state a trivial fact that connects norm with norm.

Fact A.1.

For any vector , we have .

Appendix B Neural Subspace Embedding

b.1 Preliminaries

Definition B.1 (Johnson Lindenstrauss Transform, Johnson and Lindenstrauss [1984]).

A random matrix

forms a Johnson-Lindenstrauss transform with parameters , or JLT() for short, if with probability at least , for any -element subset , for all it holds that

The well-known Count-Sketch matrix from the data stream literature Charikar et al. [2002], Thorup and Zhang [2012] is a sub-space embedding and JL matrix. The definition is provided as follows.

Definition B.2 (Count-Sketch).

Let denote a matrix. We choose a random hash function , and choose a random hash function . We set

Count-Sketch matrix gives the following subspace embedding result,

Theorem B.3 (Clarkson and Woodruff [2013], Nelson and Nguyên [2013]).

For any , and for a Count-Sketch matrix rows, then with probability , for any fixed matrix , is a -subspace embedding for . The matrix product can be computed in time. Further, all of this holds if the hash function defining is only pairwise independent, and the sign function defining is only 4-wise independent.

Definition B.4 (Oblivious Subspace Embedding(OSE), Definition 2.2 in Woodruff [2014]).

Suppose is a distribution on matrices , where is a function of and . Suppose that with probability at least , for any fixed matrix , a matrix drawn from distribution has the property that is a -subspace embedding for . Then we call an oblivious -subspace embedding.

Nelson and Nguyên [2013] provides some other constructions for subspace embedding,

Definition B.5 (Sparse-Embedding).

Let denote a matrix. For each , . For a random draw , let

be a indicator random variable for the event

, and write , where the are random signs. satisfies the following two properties, for each , ; for any set , .

Lemma B.6 (Lemma 2.2 in Woodruff [2014]).

Let , for any , there exists a -net of for which .

b.2 Proof of Theorem 3.4

Proof.

Using Theorem 3.2, we choose , with probability , we have : for fixed vectors and for all ,

Using the Lipschitz property of , we can show that

(1)

where the first step follows by Property of function , the last step follows by . It remains to bound

where the first step follows by triangle inequality, the third step follows by Eq. (B.2), the fourth step follows by , and the last step follows from . Therefore, it suffices to choose

b.3 Proof of Theorem 3.5

Proof.

The proof includes three steps. The first step is similar to the proof of Theorem 3.4. Using Theorem 3.2, we choose , with probability , we have : for fixed vectors and for all , Next, we consider the column space of , which we call , defined as follows

Let denote the -net of , by Lemma B.6, . By definition, for each , there exists a vector such that

Let and be defined as follows,

Then we want to show the following claim. (The proof can be found in Appendix B.4)

Claim B.7.

(Recursive -net). Let , then is an -net to .

Now, we choose a sketching matrix with , with probability , we have : a vector and for all ,

Using triangle inequality, we can bound the error term,

where the first step follows from triangle inequality, the second step follows from the Property of , the third step follows from Claim B.8 and Claim B.9, and the last step follows from . ∎

We list the two Claims here and delay the proofs into Appendix B.4.

Claim B.8.

Claim B.9.

b.4 One hidden layer

In this Section, we provide the proofs of some Claims used for one hidden layer case.

Claim b.7.
Proof.

For each point , there must exists a point such that

Since and is the -net of . Thus, there must exists a vector such that

According to the definition , there must exists a point such that Now, let’s consider the ,