
Improving neural networks with bunches of neurons modeled by Kumaraswamy units: Preliminary study
Deep neural networks have recently achieved stateoftheart results in ...
read it

On the Power and Limitations of Random Features for Understanding Neural Networks
Recently, a spate of papers have provided positive theoretical results f...
read it

Ridgeless Interpolation with Shallow ReLU Networks in 1D is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions
We prove a precise geometric description of all one layer ReLU networks ...
read it

Finite sample expressive power of smallwidth ReLU networks
We study universal finite sample expressivity of neural networks, define...
read it

On the CVP for the root lattices via folding with deep ReLU neural networks
Point lattices and their decoding via neural networks are considered in ...
read it

Harmless Overparametrization in Twolayer Neural Networks
Overparametrized neural networks, where the number of active parameters ...
read it

Deep ReLU Networks Preserve Expected Length
Assessing the complexity of functions computed by a neural network helps...
read it
Decoupling Gating from Linearity
ReLU neuralnetworks have been in the focus of many recent theoretical works, trying to explain their empirical success. Nonetheless, there is still a gap between current theoretical results and empirical observations, even in the case of shallow (one hiddenlayer) networks. For example, in the task of memorizing a random sample of size m and dimension d, the best theoretical result requires the size of the network to be Ω̃(m^2/d), while empirically a network of size slightly larger than m/d is sufficient. To bridge this gap, we turn to study a simplified model for ReLU networks. We observe that a ReLU neuron is a product of a linear function with a gate (the latter determines whether the neuron is active or not), where both share a jointly trained weight vector. In this spirit, we introduce the Gated Linear Unit (GaLU), which simply decouples the linearity from the gating by assigning different vectors for each role. We show that GaLU networks allow us to get optimization and generalization results that are much stronger than those available for ReLU networks. Specifically, we show a memorization result for networks of size Ω̃(m/d), and improved generalization bounds. Finally, we show that in some scenarios, GaLU networks behave similarly to ReLU networks, hence proving to be a good choice of a simplified model.
READ FULL TEXT
Comments
There are no comments yet.