The Lottery Tickets Hypothesis for Supervised and Self-supervised
Pre-training in Computer Vision Models
The computer vision world has been re-gaining enthusiasm in various
pre-trained models, including both classical ImageNet supervised pre-training
and recently emerged self-supervised pre-training such as simCLR and MoCo.
Pre-trained weights often boost a wide range of downstream tasks including
classification, detection, and segmentation. Latest studies suggest that the
pre-training benefits from gigantic model capacity. We are hereby curious and
ask: after pre-training, does a pre-trained model indeed have to stay large for
its universal downstream transferability?
In this paper, we examine the supervised and self-supervised pre-trained
models through the lens of lottery ticket hypothesis (LTH). LTH identifies
highly sparse matching subnetworks that can be trained in isolation from
(nearly) scratch, to reach the full models' performance. We extend the scope of
LTH to questioning whether matching subnetworks still exist in the pre-training
models, that enjoy the same downstream transfer performance. Our extensive
experiments convey an overall positive message: from all pre-trained weights
obtained by ImageNet classification, simCLR and MoCo, we are consistently able
to locate such matching subnetworks at 59.04
universally to multiple downstream tasks, whose performance see no degradation
compared to using full pre-trained weights. Further analyses reveal that
subnetworks found from different pre-training tend to yield diverse mask
structures and perturbation sensitivities. We conclude that the core LTH
observations remain generally relevant in the pre-training paradigm of computer
vision, but more delicate discussions are needed in some cases. Codes and
pre-trained models will be made available at:
https://github.com/VITA-Group/CV_LTH_Pre-training.
READ FULL TEXT