The Multilinear Structure of ReLU Networks
We study the loss surface of neural networks equipped with a hinge loss criterion and ReLU or leaky ReLU nonlinearities. Any such network defines a piecewise multilinear form in parameter space, and as a consequence, optima of such networks generically occur in non-differentiable regions of parameter space. Any understanding of such networks must therefore carefully take into account their non-smooth nature. We show how to use techniques from nonsmooth analysis to study these non-differentiable loss surfaces. Our analysis focuses on three different scenarios: (1) a deep linear network with hinge loss and arbitrary data, (2) a one-hidden layer network with leaky ReLUs and linearly separable data, and (3) a one-hidden layer network with ReLU nonlinearities and linearly separable data. We show that all local minima are global minima in the first two scenarios. A bifurcation occurs when passing from the second to the the third scenario, in that ReLU networks do have non-optimal local minima. We provide a complete description of such sub-optimal solutions. We conclude by investigating the extent to which these phenomena do, or do not, persist when passing to the multiclass context.
READ FULL TEXT