On Residual Networks Learning a Perturbation from Identity
The purpose of this work is to test and study the hypothesis that residual networks are learning a perturbation from identity. Residual networks are enormously important deep learning models, with many theories attempting to explain how they function; learning a perturbation from identity is one such theory. In order to answer this question, the magnitudes of the perturbations are measured in both an absolute sense as well as in a scaled sense, with each form having its relative benefits and drawbacks. Additionally, a stopping rule is developed that can be used to decide the depth of the residual network based on the average perturbation magnitude being less than a given epsilon. With this analysis a better understanding of how residual networks process and transform data from input to output is formed. Parallel experiments are conducted on MNIST as well as CIFAR10 for various sized residual networks with between 6 and 300 residual blocks. It is found that, in this setting, the average scaled perturbation magnitude is roughly inversely proportional to increasing the number of residual blocks, and from this it follows that for sufficiently large residual networks, they are learning a perturbation from identity.
READ FULL TEXT