Foundation Model is Efficient Multimodal Multitask Model Selector

08/11/2023
by   Fanqing Meng, et al.
0

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0%, 26.3%, 20.1%, 54.8%, 12.2% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

READ FULL TEXT
research
07/07/2022

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

This paper addresses an important problem of ranking the pre-trained dee...
research
03/20/2023

Visual Prompt Multi-Modal Tracking

Visible-modal object tracking gives rise to a series of downstream multi...
research
03/23/2023

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

Foundation models have shown outstanding performance and generalization ...
research
06/15/2023

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

Video Question Answering (VideoQA) has been significantly advanced from ...
research
04/26/2021

MDETR – Modulated Detection for End-to-End Multi-Modal Understanding

Multi-modal reasoning systems rely on a pre-trained object detector to e...
research
06/22/2023

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

Visual foundation models like CLIP excel in learning feature representat...
research
05/10/2023

Multi-Prompt with Depth Partitioned Cross-Modal Learning

In recent years, soft prompt learning methods have been proposed to fine...

Please sign up or login with your details

Forgot password? Click here to reset