
A Note on Connectivity of Sublevel Sets in Deep Learning
It is shown that for deep neural networks, a single wide layer of width ...
read it

The loss surface and expressivity of deep convolutional neural networks
We analyze the expressiveness and loss surface of practical deep convolu...
read it

Wider Networks Learn Better Features
Transferability of learned features between tasks can massively reduce t...
read it

DropNeuron: Simplifying the Structure of Deep Neural Networks
Deep learning using multilayer neural networks (NNs) architecture manif...
read it

On the Turnpike to Design of Deep Neural Nets: Explicit Depth Bounds
It is wellknown that the training of Deep Neural Networks (DNN) can be ...
read it

Truncating Wide Networks using Binary Tree Architectures
Recent study shows that a wide deep network can obtain accuracy comparab...
read it

Why does Deep Learning work?  A perspective from Group Theory
Why does Deep Learning work? What representations does it capture? How d...
read it
Representation mitosis in wide neural networks
Deep neural networks (DNNs) defy the classical biasvariance tradeoff: adding parameters to a DNN that exactly interpolates its training data will typically improve its generalisation performance. Explaining the mechanism behind the benefit of such overparameterisation is an outstanding challenge for deep learning theory. Here, we study the last layer representation of various deep architectures such as WideResNets for image classification and find evidence for an underlying mechanism that we call *representation mitosis*: if the last hidden representation is wide enough, its neurons tend to split into groups which carry identical information, and differ from each other only by a statistically independent noise. Like in a mitosis process, the number of such groups, or “clones”, increases linearly with the width of the layer, but only if the width is above a critical value. We show that a key ingredient to activate mitosis is continuing the training process until the training error is zero. Finally, we show that in one of the learning tasks we considered, a wide model with several automatically developed clones performs significantly better than a deep ensemble based on architectures in which the last layer has the same size as the clones.
READ FULL TEXT
Comments
There are no comments yet.