Invariance reduces Variance: Understanding Data Augmentation in Deep Learning and Beyond
Many complex deep learning models have found success by exploiting symmetries in data. Convolutional neural networks (CNNs), for example, are ubiquitous in image classification due to their use of translation symmetry, as image identity is roughly invariant to translations. In addition, many other forms of symmetry such as rotation, scale, and color shift are commonly used via data augmentation: the transformed images are added to the training set. However, a clear framework for understanding data augmentation is not available. One may even say that it is somewhat mysterious: how can we increase performance by simply adding transforms of our data to the model? Can that be information theoretically possible? In this paper, we develop a theoretical framework to start to shed light on some of these problems. We explain data augmentation as averaging over the orbits of the group that keeps the data distribution invariant, and show that it leads to variance reduction. We study finite-sample and asymptotic empirical risk minimization (using results from stochastic convex optimization, Rademacher complexity, and asymptotic statistical theory). We work out as examples the variance reduction in exponential families, linear regression, and certain two-layer neural networks under shift invariance (using discrete Fourier analysis). We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).
READ FULL TEXT