1 Introduction
In the past decade, deep neural networks have become the new standards for many machine learning applications, including computer vision
Krizhevsky et al. (2012); He et al. (2016)Zaremba et al. (2014); Gehring et al. (2017), speech recognition Graves et al. (2013); Amodei et al. (2016), robotics Lillicrap et al. (2015), game playing Silver et al. (2016, 2017), etc. Such model usually contains an enormous number of parameters, which is often much larger than the number of available training samples. Therefore, these networks are usually trained on modern computer clusters which have a huge amount of memory and computation power. On the other hand, there is an increasing need to train and run personalized machine learning models on mobile and embedded devices instead of transferring mobile data to a remote computation center on which all the computations are performed. This is because realtime processing of deep learning models on mobile devices brings the benefits of better privacy and less Internet bandwidth. However, mobile devices like smart phones do not have the memory or computation capability of training large neural networks or even storing these models.These trends motivate the study of neural network compression, with the goal of reducing the memory overhead required to train, store and run neural networks. There is a recent line of research in this direction, for example Chen et al. (2015); Iandola et al. (2016); Han et al. (2016). Despite their empirical effectiveness, there is little theoretical understanding on why these methods perform well.
The goal of this paper is to bridge the gap between theory and practice in neural network compression. Our focus is on hashingbased methods, which have been studied empirically in e.g. Chen et al. (2015, 2016), with the hope that the randomness in hash functions helps preserve the properties of neural networks despite a reduction in the number of effective parameters. We make this intuition formal by giving theoretical guarantees on the approximation power and the parameter recovery of such networks.
First, we propose a neural net compression scheme based on random linear sketching, which can be efficiently implemented using a hash function. Similar idea has been proposed in Kasiviswanathan et al. (2017) and demonstrated high performance empirically, but no formal theoretical guarantee was known. We show that such compression has strong approximation power. Namely, the small network obtained after sketching can approximate the original network on all input data coming from any lowdimensional manifold with some regularity properties. The sketched network is also directly trainable via backpropagation. In fact, sketching is a principled technique for dimensionality reduction, which has been shown to be very powerful in solving various problems arising in statistics Raskutti and Mahoney (2016); Wang et al. (2017) and numerical linear algebra Woodruff (2014). Given its theoretical success, it is natural to ask whether sketching can be applied to the context of neural net compression with theoretical guarantees. Our result makes partial progresses on this question.
Next we study HashedNets, a simple method proposed in Chen et al. (2015) which appears to perform well in practice. HashedNets directly applies a random hash function on the connection weights in a neural net and to enforce all the weights mapped to the same hash bucket to take the same value. In this way the number of trainable parameters is reduced to be the number of different hash buckets, and training can still be performed via backpropagation while taking the weight sharing structure into account. From the perspective of optimization, we show that the training objective for a onehiddenlayer hashed neural net has a local strong convexity property, similar to that of a normal fully connected network Zhong et al. (2017b). Additionally, we can apply the initialization algorithm in Zhong et al. (2017b) to obtain a good initialization for training. Therefore it implies that the parameters in onehiddenlayer HashedNets can be provably recovered under milde assumptions.
Below we describe our contributions in more detail.
Approximation Power
Our result on the approximation power of sketched nets is based on a classical concept, “subspace embedding”, which originally appears in numerical linear algebra Sarlós (2006). Roughly speaking, it says that there exist a wide family of random matrices , such that for any dimensional subspace
, with probability
we have for all , provided . This result means that the inner product between every two points in a subspace can be approximated simultaneously after applying a random sketching matrix , which is interesting if . There has been a line of work trying to do subspace embedding using different sketching matrices (e.g. Nelson and Nguyên (2013); Cohen (2016)). Sparse matrices are of particular interests, since for a sparse matrix , one can compute more efficiently. For example, Nelson and Nguyên (2013) showed that it is possible to construct with only nonzero entries per column, which significantly improves the trivial upper bound . Furthermore, many of these sketching matrices can be efficiently implemented by wise independent hash functions where is very small, which only takes a small amount of space to store, and multiplyingwith a vector can be computed efficiently.
We extend the idea of subspace embedding to deep learning and show that a feedforward fully connected network with Lipschitzcontinuous activation functions can be approximated using random sketching on all input data coming from a lowdimensional subspace. Below we describe our result for onehiddenlayer neural nets, and this can be generalized to multiple layers.
Consider a onehiddenlayer neural net with input dimension and hidden nodes. It can be parameterized by a weight matrix and a weight vector , and the function this network computes is , where is the input, and should be viewed as a nonlinear activation function acting coordinatewise on a vector. Our result says that under appropriate assumptions, one can choose a random sketching matrix , such that for any dimensional subspace , we have
This result essentially says that the weight matrix can be replaced by , which has rank . When , this means that the effective number of parameters can be reduced from to . As we mentioned, the sketching matrix can be implemented by hash functions in small space and multiplying it with a vector is efficient. The sketched network is also directly trainable, because we can train the matrix , regarding another factor in the decomposition as a known layer.
This result can be generalized to multilayer neural nets, and we present the details in Section 3. We also note that our result can be easily generalized to lowdimensional manifolds under some regularity condition (see Definition 2.3 in Baraniuk and Wakin (2009)), which is a much more realistic assumption on data.
Parameter Recovery.
It is known that training a neural net is NPhard in the worst case, even if it only has hidden nodes Blum and Rivest (1993). Recently, there has been some theoretical progress on understanding the optimization landscapes of shallow neural nets under special input distributions. In particular, Zhong et al. (2017b) gave a recovery guarantee for onehiddenlayer neural nets. They showed that if the input distribution is Gaussian and the groundtruth weight vectors corresponding to hidden nodes are linearly independent, then the true parameters can be recovered in polynomial time given finite samples. This was proved by showing that the training objective is locally strongly convex and smooth around the groundtruth point, together with an initialization method that can output a point inside the locally “nice” region. In this work, we show that local strong convexity and smoothness continue to hold if we replace the fully connected network by HashedNets which has a weight sharing structure enforced by a hash function. We present this result in Section 4.
1.1 Related Works
Parameter Reduction in Deep Learning.
There has been a series of empirical works on reducing the number of free parameters in deep neural networks: Denil et al. (2013) show a method to learn lowrank decompositions of weight matrices in each layer, Chen et al. (2015) propose an approach to use a hash function to enforce parameter sharing, Cheng et al. (2015) adopt a circulant matrix structure for parameter reduction, Sindhwani et al. (2015) study a more general class of structured matrices for parameter reduction.
Sketching and Neural Networks.
Daniely et al. (2016) show that any linear or sparse polynomial function on sparse binary data can be computed by a small singlelayer neural net on a linear sketch of the data. Kasiviswanathan et al. (2017)
apply a random sketching on weight matrices/tensors, but they only prove that given a fixed layer input, the output of this layer using sketching matrices is an unbiased estimator of the original output of this layer and has bounded variance; however, this does not provide guarantees on the approximation power of the whole sketchingbased deep net.
Subspace Embedding.
Subspace embedding Sarlós (2006)
is a fundamental tool for solving numerical linear algebra problems, e.g. linear regression, matrix lowrank approximation
Clarkson and Woodruff (2013); Nelson and Nguyên (2013); Razenshteyn et al. (2016); Song et al. (2017b), tensor lowrank approximation Song et al. (2019). See also Woodruff (2014) for a survey on this topic.Recovery Guarantee of Neural Networks.
Since learning a neural net is NPhard in the worst case Blum and Rivest (1993), many attempts have been made to design algorithms that learns a neural net provably in polynomial time and sample complexity under additional assumptions, e.g., Sedghi and Anandkumar (2014); Zhang et al. (2015); Janzamin et al. (2015); Goel et al. (2017); Goel and Klivans (2017a, b)
. Another line of work focused on analyzing (stochastic) gradient descent on shallow networks for Gaussian input distributions, e.g.,
Brutzkus and Globerson (2017); Zhong et al. (2017a, b); Tian (2017); Li and Yuan (2017); Du et al. (2017); Soltanolkotabi (2017).Other Related Works
Instead of understanding the parameter reduction as our work, there are several results working on developing overparameterization theory of deep ReLU neural networks, e.g.
AllenZhu et al. (2018a, b). Thirty years ago, Blum and Rivest proved training neural network is NPhard Blum and Rivest (1993). Later, neural networks have been shown hard in several different perspectives Klivans and Sherstov (2009); Livni et al. (2014); Daniely (2016); Daniely and ShalevShwartz (2016); Goel et al. (2017); Song et al. (2017a); Katz et al. (2017); Weng et al. (2018); Manurangsi and Reichman (2018) in the worst case regime.Arora et al. proved a stronger generalization for deep nets via a compression approach Arora et al. (2018). There is a long line of works targeting on explaining GAN from theoretical perspective Arora and Zhang (2017); Arora et al. (2017b, a); Bora et al. (2017); Li et al. (2018); Santurkar et al. (2018); Van Veen et al. (2018); Xiao et al. (2018). There is also a long line of provable results about adversarial examples Madry et al. (2017); Bubeck et al. (2018b, a); Weng et al. (2018); Schmidt et al. (2018); Tran et al. (2018).
2 Preliminaries
For any positive integer , we use to denote the set . Let represent any number in the interval . For any vector , we use , and to denote its , and norms, respectively. For , we use to denote the standard Euclidean inner product .
For a matrix , let denote its determinant (if is a square matrix), let denote the MoorePenrose pseudoinverse of , and let and denote respectively the Frobenius norm and the spectral norm of . Denote by the
th largest singular value of
. We use to denote the number of nonzero entries in .For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants .
We define the and balls in as: We also need the definitions of Lipschitzcontinuous functions and wise independent hash families.
Definition 2.1.
A function is Lipshitz continuous, if for all ,
Definition 2.2.
A family of hash functions is said to be wise independent if for any and any we have
3 The Approximation Power of ParameterReduced Neural Networks
In this section, we study the approximation power of parameterreduced neural nets based on hashing. Any weight matrix in a neural net acts on a vector as . We replace by for some sketching (rows columns) matrix defined in the following section. Then the new weight matrix has much fewer parameters. We show that if is chosen properly as a subspace embedding (formally defined later in this section), the sketched network can approximate the original network on all inputs coming from a lowdimensional subspace or manifold. Our sketching matrix is chosen as a JohnsonLindenstrauss (JL) Johnson and Lindenstrauss (1984) transformation matrix. In Section 3.1, we provide some preliminaries on subspace embedding. In Sections 3.2 and 3.3, we present our result on onehiddenlayer neural nets. Then in Section 3.4 we extend this result to multilayer neural nets and show a similar approximation guarantee. This provides a theoretical guarantee for hashingbased parameterreduced networks used in practice.
3.1 Subspace Embedding
We first present some basic definitions of sketching and subspace embedding. These mathematical tools are building blocks for us to understand parameterreduced neural networks.
Definition 3.1 (Subspace Embedding).
A subspace embedding for the column space of an matrix is a matrix for which for all , or equivalently, for all ,
Constructions of subspace embedding can be found in e.g. Nelson and Nguyên (2013) from which there is the following theorem.
Theorem 3.2 (Nelson and Nguyên (2013)).
There is a oblivious^{1}^{1}1The construction of is oblivious to the subspace . subspace embedding for matrix with rows and error probability . Further, can be computed in time . We call a SparseEmbedding matrix.
There are also other subspace embedding matrices, e.g., CountSketch. We provide additional definitions and examples in Section B.1.
Remark 3.3.
We remark that the subspace embedding in Definition 3.1 naturally extends to low dimensional manifolds. For example, for a dimensional Riemannian submanifold of with volumn and geodesic covering regularity (see Definition 2.3 in Baraniuk and Wakin (2009)), Theorem 3.2 holds by replacing with . For ease of presentation, we only present our results for subspaces. All our results can be extended to lowdimensional manifolds satisfying regularity conditions.
3.2 One Hidden Layer  Part i@
We consider onehiddenlayer neural nets in the form , where is the input vector, is a weight vector, is a weight scalar, and is a nonlinear activation function. In this subsection, we show how to sketch the weights between the input layer and the hidden layer with guaranteed approximation power. The main result is Theorem 3.4 and its proofs are in Appendix B.
Theorem 3.4.
Given parameters and . Given activation functions that are Lipshitzcontinuous, a fixed matrix , weight matrix with , with . Choose a SparseEmbedding matrix with , then with probability , we have : for all ,
3.3 Onehidden layer  Part ii@
In this section, we show the approximation power of the compressed network if the weight matrices of both the input layer and output layer are sketched. One of the core idea in the proof is a recursive net argument, which plays a crucial role in extending the result to multiple hidden layer. The goal of this section is to prove the following theorem and present the recursive net argument.
Theorem 3.5.
Given parameters and . Given activation functions with Lipshitz and normalized by , a fixed matrix , and weight matrix with , with . Choose a SparseEmbedding matrix and with , then with probability , we have : for all ,
The high level idea is as follows. Firstly, we prove that for any fixed input , the theorem statement holds with high probability. Then we build an sufficiently fine net over the input space of and argue that our statement holds for every input point from the net. Condition on this event, the statement holds by applying the Lipshitz continuity of the activation function. The detailed proof is presented in Appendix B.
3.4 Multiple hidden layer
In this section, we generalize our approximation power result to a multilayer neural network and delay the proofs to Appendix B
. Inspired by the batch normalization
Ioffe and Szegedy (2015), which has been widely used in practice^{2}^{2}2https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization, we make an additional assumption by requiring the activations to be normalized by at each layer . The way we deal with multiple hidden layers is, first recursively argue an net can be constructed for all the layers with the same size. Then we use triangle inequality to split error into terms and bounding them separately. The result is the following theorem.Theorem 3.6.
Given parameters , and . For each , for each let denote an activation function with Lipshitz and normalized by . Given a fixed matrix , weight matrices with (the th column of ) , a weight vector with . For each , we choose a SparseEmbedding matrix with
Then with probability , we have : for all ,
where and are defined inductively. The base case is and the inductive case is
Note that similar results also hold for the case without using . In other words, we only choose matrices for hidden layers.
4 Recovery Guarantee
In this section, we study the recovery guarantee of parameterreduced neural nets. In particular, we study whether (stochastic) gradient descent can learn the true parameters in a onehiddenlayer HashedNets when starting from a sufficiently good initialization point, under appropriate assumptions. We show that even under the special weight sharing structure depicted by the hash function, the resulting neural net still has sufficiently nice properties  namely, local strong convexity and smoothness around the minimizer. Our proof technique is by reducing our case to that of the fully connected network studied in Zhong et al. (2017b). After that, the recovery guarantee follows similarly. We present our result here and give the detailed proof in Appendix C.
We consider the following regression problem : given a set of samples Let be an underlying distribution over with parameters and , such that each sample is sampled i.i.d. from this distribution, with Here is a random hash function drawn from a wise independent hash family , where , and is an activation function.
Note that has a corresponding matrix defined as , which is the actual weight matrix in the HashedNets with a weight sharing structure.
Our goal is to recover the groundtruth parameters given the sample . Note that how to recover has been discussed in Zhong et al. (2017a, b); their method also applies to our situation. Therefore we focus on recovering in this section, assuming is known.
For a given weight vector , we define its expected risk and empirical risk as
We first show a structural result for wise independent hash family, which says the preimage of each bucket is pretty balanced.
Lemma 4.1 (Concentration of hashing buckets, part of Lemma c.12).
Given integers and . Let denote a wise independent hash function such that Then, if , with probability at least , we have for all ,
The previous work Zhong et al. (2017b) showed that a fully connected network whose groundtruth weight matrix has rank has local strong convexity and smoothness around
in its loss function (see their Lemma D.3).
Using Lemma 4.1 as our core reduction tool, we can reduce HashedNets to a fully connected net and obtain the following result:
Theorem 4.2 (Local strong convexity and smoothness).
Suppose . Then we have
where and are positive parameters that depend on and the activation function .
Remark 4.3.
For the empirical risk , we can show that its Hessian at the optimal point also satisfies similar properties given enough samples. See Theorem C.7 for details.
Using the tensor initialization method in Zhong et al. (2017b), we can find a point in the locally “nice region” around , and then we can show that gradient descent on the empirical risk function converges linearly to . The result is summarized as follows.
Theorem 4.4 (Recovery guarantee).
There exist parameters and that depend on and such that the following is true. Let be any point satisfying and let denote a set of i.i.d. samples from the distribution . Define where and are the same ones in Theorem 4.2. For any , if we choose and perform gradient descent with step size on and obtain the next iterate, then with probability at least , we have
The above theorem states that once a constantlyaccurate initialization point is specified, we can obtain a solution up to precision in a polynomial number of gradient descent iterations. This concludes the recovery guarantee. We give the formal statements and proofs in Section C.
5 Experiments
In this section, we perform some simple experiments on MNIST dataset to evaluate the performance of HashedNets, as well as empirically verify the full rank assumption (as in Theorem 4.2) on weight matrices in HashedNets. Each image in MNIST dataset has a dimensionality of . The HashedNets in the experiment have singlehiddenlayer, i.e., two fully connected layers. To validate the effectiveness of HashedNets, we construct two baselines.

SmallNets. A singlehiddenlayer network is constructed with the same amount of effective weights as that of HashedNets. For example, for a HashedNets with hidden units in the hidden layer with compression ratio 64, a corresponding SmallNets have hidden units in the hidden layer.

ThinNets. A twohiddenlayer network is constructed with the same amount of effective weights as that of HashedNets. By replacing the first fully connected layer in HashedNets with a thin hidden layer, a same amount of weights can be achieved. For example, for a HashedNets with hidden units in the hidden layer with compression ratio 64, a corresponding ThinNets have hidden units for the first hidden layer and 1000 hidden units for the second hidden layer.
The accuracy of HashedNets, SmallNets, and ThinNets is compared under various compression ratios.
The HashedNets were implemented in Torch7 Collobert et al. (2011); Chen et al. (2015)
and validated on NVIDIA GTX 1080 GPU. We used 32 bit precision floating point numbers throughout the experiments. Stochastic gradient descent was adopted as the numerical optimizer with a dropout keep rate of 0.9, momentum of 0.9, and a batch size of 50. ReLU was used as the activation function. We ran 1000 epochs for each experiment and experiments on two singlehiddenlayer HashedNets with 500 and 1000 hidden units are conducted, respectively. The amount of units in SmallNets and ThinNets is adjusted to match the amount of weights in HashedNets.
For different compression ratios, we plot the distribution of testing errors for 50 runs of HashedNets, ThinNets, and SmallNets with 50 different random seeds, as shown in Figure 1. Due to random initialization, SmallNets still gives different results with independent runs. In Figure 1
(a), when the compression ratio is 1, which indicating no compression, the distributions for both HashedNets and SmallNets are very close, i.e., with means of 1.37% and 1.40%, standard deviations of 0.050% and 0.038%, respectively. ThinNets provides slightly better testing error with a mean of 1.27% and a standard deviation of 0.057%. In Figure
1(b), when the compression ratio is 8, HashedNets provides smaller testing errors than that of SmallNets, i.e., with means of 1.43% v.s. 1.76%, and standard deviations of 0.052% v.s. 0.070%, respectively. ThinNets provides slightly better testing errors than that of HashedNets, i.e., with means of 1.32% v.s. 1.44%, and standard deviations of 0.060% v.s. 0.056%. In other words, both HashedNets and ThinNets can achieve higher and more robust accuracy with improvements in both mean and standard deviation than SmallNets for this compression ratio. With the compression ratio increasing to 64, as shown in Figure 1(c), the benefit of HashedNets is more significant. The mean of testing errors for HashedNets degrades to 2.80%, while that for SmallNets increases to 6.09%. The errors for SmallNets are more instable due to larger standard deviations. Meanwhile, HashedNets can also provide better accuracy than ThinNets, i.e., with means of 2.80% v.s. 5.03%, and standard deviations of 0.090% v.s. 0.196%. When the compression ratio is 128, as shown in Figure 1(d), HashedNets achieves a mean accuracy of 4.20% and a standard deviation of 0.116%, which is much better than that of ThinNets, a mean accuracy of 11.09% and a standard deviation of 0.160%, and that of SmallNets, a mean accuracy of 10.28% and a standard deviation of 0.810%. In summary, from the aspect of accuracy degradation, when the compression ratio increases from 1 to 128, there is on average 2.83% degradation in accuracy for HashedNets, while the accuracy of SmallNets degrades by 8.88% and that of ThinNets degrades by 9.82%. ThinNets may achieve comparable error to HashedNet for small compression ratios (e.g., 1 and 8), while for large compression ratio, HashedNet tends to be more stable.Figure 2 plots the training curves of different networks under different compression ratios with random seed 100. The testing errors align with the observation from Figure 1. That is, HashedNets provides high and stable accuracy across various compression ratios; ThinNets achieves good accuracy for small compression ratios (e.g., 1 and 8, the accuracy is close to that of HashedNets), while degrades significantly with the increase of compression ratios; SmallNets are also very sensitive to large compression ratios like ThinNets.
We further verify the full rank assumption of weight matrices in HashedNets. Figure 3 plots the distributions of minimum and maximum singular values, condition numbers, and stable ranks of the two weight matrices and in HashedNets with 1000 hidden units. The dimensions of is and that of is . The distributions are extracted from the aforementioned 50 runs. Figure 4 (in the Appendix) gives one example of all singular values sorted from large to small in one experiment. All the singular values and condition numbers are distributed in reasonable scales, i.e., neither too close to zero, nor too large. This experiment indicates that the assumption of full rank holds in practice. Same set of figures are also provided for HashedNets with 500 hidden units, as shown in Figure 5 (in the Appendix), where the dimensions of is and that of is .
6 Conclusion
In this paper, we study the theoretical properties of hashingbased neural networks. We show that (i) parameterreduced neural nets have uniform approximation power on inputs from any lowdimensional subspace or smooth and wellconditioned manifold; (ii) onehiddenlayer HashedNets have similar recovery guarantee to that of fully connected neural nets. We also empirically explore an alternative compression scheme, ThinNets, which is a very interesting direction for further study, so we plan to explore its property and theoretical insights in the future.
Acknowledgments
This project was done while Zhao was visiting Harvard and hosted by Prof. Jelani Nelson. In 2017, Prof. Jelani Nelson and Prof. Piotr Indyk cotaught a class “Sketching Algorithms for Big Data”. One of the class project is, “Understanding hashing trick of neural network.” That is the initialization of this project. The authors would like to thank Zhixian Lei for inspiring us of this project.
The authors would like to appreciate Wei Hu for his generous help and contribution to this project.
The authors would like to thank Rasmus Kyng, Eric Price, Zhengyu Wang and Peilin Zhong for useful discussions.
References
 AllenZhu et al. [2018a] Z. AllenZhu, Y. Li, and Z. Song. A convergence theory for deep learning via overparameterization. In arXiv preprint. https://arxiv.org/pdf/1811.03962, 2018a.
 AllenZhu et al. [2018b] Z. AllenZhu, Y. Li, and Z. Song. On the convergence rate of training recurrent neural networks. In arXiv preprint. https://arxiv.org/pdf/1810.12065, 2018b.
 Amodei et al. [2016] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. Deep speech 2: Endtoend speech recognition in english and mandarin. In International Conference on Machine Learning, pages 173–182, 2016.
 Arora and Zhang [2017] S. Arora and Y. Zhang. Do gans actually learn the distribution? an empirical study. arXiv preprint arXiv:1706.08224, 2017.
 Arora et al. [2017a] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang. Generalization and equilibrium in generative adversarial nets (gans). In ICML. arXiv preprint arXiv:1703.00573, 2017a.
 Arora et al. [2017b] S. Arora, A. Risteski, and Y. Zhang. Theoretical limitations of encoderdecoder gan architectures. arXiv preprint arXiv:1711.02651, 2017b.
 Arora et al. [2018] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. In ICML. arXiv preprint arXiv:1802.05296, 2018.
 Baraniuk and Wakin [2009] R. G. Baraniuk and M. B. Wakin. Random projections of smooth manifolds. Foundations of computational mathematics, 9(1):51–77, 2009.
 Bellare and Rompel [1994] M. Bellare and J. Rompel. Randomnessefficient oblivious sampling. In Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on, pages 276–287. IEEE, 1994.
 Blum and Rivest [1993] A. L. Blum and R. L. Rivest. Training a 3node neural network is npcomplete. In Machine learning: From theory to applications, pages 9–28. Springer, 1993.
 Bora et al. [2017] A. Bora, A. Jalal, E. Price, and A. G. Dimakis. Compressed sensing using generative models. In ICML. arXiv preprint arXiv:1703.03208, 2017.
 Bourgain et al. [2010] J. Bourgain, V. H. Vu, and P. M. Wood. On the singularity probability of discrete random matrices. Journal of Functional Analysis, 258(2):559–603, 2010.
 Brutzkus and Globerson [2017] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In ICML, pages 605–614, 2017.
 Bubeck et al. [2018a] S. Bubeck, Y. T. Lee, E. Price, and I. Razenshteyn. Adversarial examples from cryptographic pseudorandom generators. arXiv preprint arXiv:1811.06418, 2018a.
 Bubeck et al. [2018b] S. Bubeck, E. Price, and I. Razenshteyn. Adversarial examples from computational constraints. arXiv preprint arXiv:1805.10204, 2018b.
 Charikar et al. [2002] M. Charikar, K. Chen, and M. FarachColton. Finding frequent items in data streams. In Automata, Languages and Programming, pages 693–703. Springer, 2002.
 Chen et al. [2015] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning, pages 2285–2294, 2015.

Chen et al. [2016]
W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen.
Compressing convolutional neural networks in the frequency domain.
In KDD, pages 1475–1484. ACM, 2016.  Cheng et al. [2015] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.F. Chang. An exploration of parameter redundancy in deep networks with circulant projections. In Proceedings of the IEEE International Conference on Computer Vision, pages 2857–2865, 2015.
 Clarkson and Woodruff [2013] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference, STOC’13, Palo Alto, CA, USA, June 14, 2013, pages 81–90. https://arxiv.org/pdf/1207.6365, 2013.
 Cohen [2016] M. B. Cohen. Nearly tight oblivious subspace embeddings by trace inequalities. In Proceedings of the TwentySeventh Annual ACMSIAM Symposium on Discrete Algorithms (SODA), Arlington, VA, USA, January 1012, 2016, pages 278–287, 2016.
 Collobert et al. [2011] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlablike environment for machine learning. In BigLearn, NIPS Workshop, number EPFLCONF192376 in ., 2011.

Daniely [2016]
A. Daniely.
Complexity theoretic limitations on learning halfspaces.
In
Proceedings of the fortyeighth annual ACM symposium on Theory of Computing (STOC)
, pages 105–117. ACM, 2016.  Daniely and ShalevShwartz [2016] A. Daniely and S. ShalevShwartz. Complexity theoretic limitations on learning DNFs. In Conference on Learning Theory (COLT), pages 815–830, 2016.
 Daniely et al. [2016] A. Daniely, N. Lazic, Y. Singer, and K. Talwar. Sketching and neural networks. arXiv preprint arXiv:1604.05753, 2016.
 Denil et al. [2013] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems, pages 2148–2156, 2013.
 Du et al. [2017] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
 Gehring et al. [2017] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional Sequence to Sequence Learning. In ArXiv preprint:1705.03122, 2017.
 Goel and Klivans [2017a] S. Goel and A. Klivans. Eigenvalue decay implies polynomialtime learnability for neural networks. In NIPS, pages 2189–2199, 2017a.
 Goel and Klivans [2017b] S. Goel and A. Klivans. Learning depththree neural networks in polynomial time. arXiv preprint arXiv:1709.06010, 2017b.
 Goel et al. [2017] S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In COLT, pages 1004–1042, 2017.

Graves et al. [2013]
A. Graves, A.r. Mohamed, and G. Hinton.
Speech recognition with deep recurrent neural networks.
In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.  Han et al. [2016] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR). arXiv preprint arXiv:1510.00149, 2016.

He et al. [2016]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  Iandola et al. [2016] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 Ioffe and Szegedy [2015] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 Janzamin et al. [2015] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of nonconvexity: Guaranteed training of neural networks using tensor methods. In arXiv preprint. https://arxiv.org/pdf/1506.08473, 2015.
 Johnson and Lindenstrauss [1984] W. B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189206):1, 1984.
 Kasiviswanathan et al. [2017] S. P. Kasiviswanathan, N. Narodytska, and H. Jin. Deep neural network approximation using tensor sketching. arXiv preprint arXiv:1710.07850, 2017.
 Katz et al. [2017] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification (CAV), pages 97–117. Springer, 2017.
 Klivans and Sherstov [2009] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. Journal of Computer and System Sciences, 75(1):2–12, 2009.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 Li et al. [2018] J. Li, A. Madry, J. Peebles, and L. Schmidt. On the limitations of first order approximation in gan dynamics. In ICML, 2018.
 Li and Yuan [2017] Y. Li and Y. Yuan. Convergence analysis of twolayer neural networks with relu activation. In NIPS, pages 597–607, 2017.
 Lillicrap et al. [2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Livni et al. [2014] R. Livni, S. ShalevShwartz, and O. Shamir. On the computational efficiency of training neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 855–863, 2014.
 Madry et al. [2017] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR. arXiv preprint arXiv:1706.06083, 2017.
 Manurangsi and Reichman [2018] P. Manurangsi and D. Reichman. The computational complexity of training ReLU(s). arXiv preprint arXiv:1810.04207, 2018.
 Nelson and Nguyên [2013] J. Nelson and H. L. Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117–126. IEEE, https://arxiv.org/pdf/1211.1002, 2013.

Raskutti and Mahoney [2016]
G. Raskutti and M. Mahoney.
A statistical perspective on randomized sketching for ordinary leastsquares.
JMLR, 2016.  Razenshteyn et al. [2016] I. Razenshteyn, Z. Song, and D. P. Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), 2016.
 Salmond et al. [2014] D. Salmond, A. Grant, I. Grivell, and T. Chan. On the rank of random matrices over finite fields. arXiv preprint arXiv:1404.3250, 2014.
 Santurkar et al. [2018] S. Santurkar, L. Schmidt, and A. Madry. A classificationbased study of covariate shift in gan distributions. In International Conference on Machine Learning (ICML), pages 4487–4496, 2018.
 Sarlós [2006] T. Sarlós. Improved approximation algorithms for large matrices via random projections. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS) , 2124 October 2006, Berkeley, California, USA, Proceedings, pages 143–152, 2006.
 Schmidt et al. [2018] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Mądry. Adversarially robust generalization requires more data. In NeurIPS. arXiv preprint arXiv:1804.11285, 2018.
 Sedghi and Anandkumar [2014] H. Sedghi and A. Anandkumar. Provable methods for training neural networks with sparse connectivity. arXiv preprint arXiv:1412.2693, 2014.
 Silver et al. [2016] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
 Silver et al. [2017] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Sindhwani et al. [2015] V. Sindhwani, T. Sainath, and S. Kumar. Structured transforms for smallfootprint deep learning. In Advances in Neural Information Processing Systems, pages 3088–3096, 2015.
 Soltanolkotabi [2017] M. Soltanolkotabi. Learning relus via gradient descent. In NIPS, pages 2004–2014, 2017.
 Song et al. [2017a] L. Song, S. Vempala, J. Wilmes, and B. Xie. On the complexity of learning neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 5514–5522, 2017a.
 Song et al. [2017b] Z. Song, D. P. Woodruff, and P. Zhong. Low rank approximation with entrywise norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017b.
 Song et al. [2019] Z. Song, D. P. Woodruff, and P. Zhong. Relative error tensor low rank approximation. In SODA. https://arxiv.org/pdf/1704.08246, 2019.

Thorup and Zhang [2012]
M. Thorup and Y. Zhang.
Tabulationbased 5independent hashing with applications to linear probing and second moment estimation.
SIAM Journal on Computing, 41(2):293–331, 2012.  Tian [2017] Y. Tian. An analytical formula of population gradient for twolayered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.
 Tran et al. [2018] B. Tran, J. Li, and A. Madry. Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems (NeurIPS), pages 8010–8020, 2018.
 Van Veen et al. [2018] D. Van Veen, A. Jalal, E. Price, S. Vishwanath, and A. G. Dimakis. Compressed sensing with deep image prior and learned regularization. arXiv preprint arXiv:1806.06438, 2018.

Wang et al. [2017]
S. Wang, A. Gittens, and M. W. Mahoney.
Sketched ridge regression: Optimization perspective, statistical perspective, and model averaging.
ICML, 2017.  Weng et al. [2018] T.W. Weng, H. Zhang, H. Chen, Z. Song, C.J. Hsieh, D. Boning, I. S. Dhillon, and L. Daniel. Towards fast computation of certified robustness for ReLU networks. In International Conference on Machine Learning (ICML). arXiv preprint arXiv:1804.09699, 2018.
 Woodruff [2014] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science, 10(12):1–157, 2014.
 Xiao et al. [2018] C. Xiao, P. Zhong, and C. Zheng. Bourgan: Generative networks with metric embeddings. In NeurIPS. arXiv preprint arXiv:1805.07674, 2018.
 Zaremba et al. [2014] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zhang et al. [2015] Y. Zhang, J. D. Lee, M. J. Wainwright, and M. I. Jordan. Learning halfspaces and neural networks with random initialization. arXiv preprint arXiv:1511.07948, 2015.
 Zhong et al. [2017a] K. Zhong, Z. Song, and I. S. Dhillon. Learning nonoverlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017a.
 Zhong et al. [2017b] K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for onehiddenlayer neural networks. In ICML, 2017b.
Appendix
Appendix A Notation
For any positive integer , we use to denote the set
. For random variable
, let denote the expectation of (if this quantity exists). For any vector , we use to denote its norm.We provide several definitions related to matrix . Let denote the determinant of a square matrix . Let denote the transpose of . Let denote the MoorePenrose pseudoinverse of . Let denote the inverse of a full rank square matrix. Let denote the Frobenius norm of matrix . Let denote the spectral norm of matrix . Let to denote the th largest singular value of .
For any function , we define to be . In addition to notation, for two functions , we use the shorthand (resp. ) to indicate that (resp. ) for an absolute constant . We use to mean for constants .
We state a trivial fact that connects norm with norm.
Fact A.1.
For any vector , we have .
Appendix B Neural Subspace Embedding
b.1 Preliminaries
Definition B.1 (Johnson Lindenstrauss Transform, Johnson and Lindenstrauss [1984]).
forms a JohnsonLindenstrauss transform with parameters , or JLT() for short, if with probability at least , for any element subset , for all it holds that
The wellknown CountSketch matrix from the data stream literature Charikar et al. [2002], Thorup and Zhang [2012] is a subspace embedding and JL matrix. The definition is provided as follows.
Definition B.2 (CountSketch).
Let denote a matrix. We choose a random hash function , and choose a random hash function . We set
CountSketch matrix gives the following subspace embedding result,
Theorem B.3 (Clarkson and Woodruff [2013], Nelson and Nguyên [2013]).
For any , and for a CountSketch matrix rows, then with probability , for any fixed matrix , is a subspace embedding for . The matrix product can be computed in time. Further, all of this holds if the hash function defining is only pairwise independent, and the sign function defining is only 4wise independent.
Definition B.4 (Oblivious Subspace Embedding(OSE), Definition 2.2 in Woodruff [2014]).
Suppose is a distribution on matrices , where is a function of and . Suppose that with probability at least , for any fixed matrix , a matrix drawn from distribution has the property that is a subspace embedding for . Then we call an oblivious subspace embedding.
Nelson and Nguyên [2013] provides some other constructions for subspace embedding,
Definition B.5 (SparseEmbedding).
Let denote a matrix. For each , . For a random draw , let
be a indicator random variable for the event
, and write , where the are random signs. satisfies the following two properties, for each , ; for any set , .Lemma B.6 (Lemma 2.2 in Woodruff [2014]).
Let , for any , there exists a net of for which .
b.2 Proof of Theorem 3.4
Proof.
Using Theorem 3.2, we choose , with probability , we have : for fixed vectors and for all ,
Using the Lipschitz property of , we can show that
(1) 
where the first step follows by Property of function , the last step follows by . It remains to bound
where the first step follows by triangle inequality, the third step follows by Eq. (B.2), the fourth step follows by , and the last step follows from . Therefore, it suffices to choose
∎
b.3 Proof of Theorem 3.5
Proof.
The proof includes three steps. The first step is similar to the proof of Theorem 3.4. Using Theorem 3.2, we choose , with probability , we have : for fixed vectors and for all , Next, we consider the column space of , which we call , defined as follows
Let denote the net of , by Lemma B.6, . By definition, for each , there exists a vector such that
Let and be defined as follows,
Then we want to show the following claim. (The proof can be found in Appendix B.4)
Claim B.7.
(Recursive net). Let , then is an net to .
Now, we choose a sketching matrix with , with probability , we have : a vector and for all ,
Using triangle inequality, we can bound the error term,
where the first step follows from triangle inequality, the second step follows from the Property of , the third step follows from Claim B.8 and Claim B.9, and the last step follows from . ∎
We list the two Claims here and delay the proofs into Appendix B.4.
Claim B.8.
Claim B.9.
b.4 One hidden layer
In this Section, we provide the proofs of some Claims used for one hidden layer case.
Proof.
For each point , there must exists a point such that
Since and is the net of . Thus, there must exists a vector such that
According to the definition , there must exists a point such that Now, let’s consider the ,
Comments
There are no comments yet.