Blessing of Class Diversity in Pre-training
This paper presents a new statistical analysis aiming to explain the recent superior achievements of the pre-training techniques in natural language processing (NLP). We prove that when the classes of the pre-training task (e.g., different words in the masked language model task) are sufficiently diverse, in the sense that the least singular value of the last linear layer in pre-training (denoted as ν̃) is large, then pre-training can significantly improve the sample efficiency of downstream tasks. Specially, we show the transfer learning excess risk enjoys an O(1/ν̃√(n)) rate, in contrast to the O(1/√(m)) rate in the standard supervised learning. Here, n is the number of pre-training data and m is the number of data in the downstream task, and typically n ≫ m. Our proof relies on a vector-form Rademacher complexity chain rule for disassembling composite function classes and a modified self-concordance condition. These techniques can be of independent interest.
READ FULL TEXT