Exploring the Limits of Large Scale Pre-training

10/05/2021
by   Samira Abnar, et al.
5

Recent developments in large-scale machine learning suggest that by scaling up data, model size and training time properly, one might observe that improvements in pre-training would transfer favorably to most downstream tasks. In this work, we systematically study this phenomena and establish that, as we increase the upstream accuracy, the performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks. Delving deeper to understand the reasons that give rise to these phenomena, we show that the saturation behavior we observe is closely related to the way that representations evolve through the layers of the models. We showcase an even more extreme scenario where performance on upstream and downstream are at odds with each other. That is, to have a better downstream performance, we need to hurt upstream accuracy.

READ FULL TEXT

page 28

page 29

page 34

page 35

research
08/07/2022

How Adversarial Robustness Transfers from Pre-training to Downstream Tasks

Given the rise of large-scale training regimes, adapting pre-trained mod...
research
03/10/2022

Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability

Large-scale pre-training has been proven to be crucial for various compu...
research
06/21/2023

Task-Robust Pre-Training for Worst-Case Downstream Adaptation

Pre-training has achieved remarkable success when transferred to downstr...
research
05/31/2023

Diffused Redundancy in Pre-trained Representations

Representations learned by pre-training a neural network on a large data...
research
05/24/2023

Delving Deeper into Data Scaling in Masked Image Modeling

Understanding whether self-supervised learning methods can scale with un...
research
12/01/2022

Scaling Language-Image Pre-training via Masking

We present Fast Language-Image Pre-training (FLIP), a simple and more ef...
research
11/18/2022

Improved Cross-view Completion Pre-training for Stereo Matching

Despite impressive performance for high-level downstream tasks, self-sup...

Please sign up or login with your details

Forgot password? Click here to reset