Training Over-parameterized Deep ResNet Is almost as Easy as Training a Two-layer Network

03/17/2019

∙

It has been proved that gradient descent converges linearly to the global minima for training deep neural network in the over-parameterized regime. However, according to allen2018convergence, the width of each layer should grow at least with the polynomial of the depth (the number of layers) for residual network (ResNet) in order to guarantee the linear convergence of gradient descent, which shows no obvious advantage over feedforward network. In this paper, we successfully remove the dependence of the width on the depth of the network for ResNet and reach a conclusion that training deep residual network can be as easy as training a two-layer network. This theoretically justifies the benefit of skip connection in terms of facilitating the convergence of gradient descent. Our experiments also justify that the width of ResNet to guarantee successful training is much smaller than that of deep feedforward neural network.

READ FULL TEXT

Training Over-parameterized Deep ResNet Is almost as Easy as Training a Two-layer Network

Sign in with Google

Consider DeepAI Pro