Could Giant Pretrained Image Models Extract Universal Representations?

11/03/2022
by   Yutong Lin, et al.
0

Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.

READ FULL TEXT

page 7

page 16

research
04/11/2023

A Billion-scale Foundation Model for Remote Sensing Images

As the potential of foundation models in visual tasks has garnered signi...
research
01/07/2021

Self-Supervised Pretraining of 3D Features on any Point-Cloud

Pretraining on large labeled datasets is a prerequisite to achieve good ...
research
06/27/2023

MIMIC: Masked Image Modeling with Image Correspondences

Many pixelwise dense prediction tasks-depth estimation and semantic segm...
research
11/14/2021

Co-segmentation Inspired Attention Module for Video-based Computer Vision Tasks

Computer vision tasks can benefit from the estimation of the salient obj...
research
03/30/2023

Whether and When does Endoscopy Domain Pretraining Make Sense?

Automated endoscopy video analysis is a challenging task in medical comp...
research
12/05/2022

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

In-context learning, as a new paradigm in NLP, allows the model to rapid...
research
04/19/2020

ResNeSt: Split-Attention Networks

While image classification models have recently continued to advance, mo...

Please sign up or login with your details

Forgot password? Click here to reset