Stationary Points of Shallow Neural Networks with Quadratic Activation Function
We consider the problem of learning shallow neural networks with quadratic activation function and planted weights W^*∈R^m× d, where m is the width of the hidden layer and d≤ m is the data dimension. We establish that the landscape of the population risk L(W) admits an energy barrier separating rank-deficient solutions: if W∈R^m× d with rank(W)<d, then L(W)≥ 2σ_min(W^*)^4, where σ_min(W^*) is the smallest singular value of W^*. We then establish that all full-rank stationary points of L(·) are necessarily global optimum. These two results propose a simple explanation for the success of gradient descent in training such networks, when properly initialized: gradient descent algorithm finds global optimum due to absence of spurious stationary points within the set of full-rank matrices. We then show if the planted weight matrix W^*∈R^m× d has iid Gaussian entries, and is sufficiently wide, that is m>Cd^2 for a large C, then it is easy to construct a full rank matrix W with population risk below the energy barrier, starting from which gradient descent is guaranteed to converge to a global optimum. Our final focus is on sample complexity: we identify a simple necessary and sufficient geometric condition on the training data under which any minimizer of the empirical loss has necessarily small generalization error. We show that as soon as n≥ n^*=d(d+1)/2, random data enjoys this geometric condition almost surely, and in fact the generalization error is zero. At the same time we show that if n<n^*, then when the data is i.i.d. Gaussian, there always exists a matrix W with zero empirical risk, but with population risk bounded away from zero by the same amount as rank deficient matrices, namely by 2σ_min(W^*)^4.
READ FULL TEXT