Cross-Modal Fine-Tuning: Align then Refine

02/11/2023
by   Junhong Shen, et al.
0

Fine-tuning large-scale pretrained models has led to tremendous progress in well-studied modalities such as vision and NLP. However, similar gains have not been observed in many other modalities due to a lack of relevant pretrained models. In this work, we propose ORCA, a general cross-modal fine-tuning framework that extends the applicability of a single large-scale pretrained model to diverse modalities. ORCA adapts to a target task via an align-then-refine workflow: given the target input, ORCA first learns an embedding network that aligns the embedded feature distribution with the pretraining modality. The pretrained model is then fine-tuned on the embedded data to exploit the knowledge shared across modalities. Through extensive experiments, we show that ORCA obtains state-of-the-art results on 3 benchmarks containing over 60 datasets from 12 modalities, outperforming a wide range of hand-designed, AutoML, general-purpose, and task-specific methods. We highlight the importance of data alignment via a series of ablation studies and demonstrate ORCA's utility in data-limited regimes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/02/2023

Instance-level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space

Malicious perturbations embedded in input data, known as Trojan attacks,...
research
03/28/2023

Large-scale pretraining on pathological images for fine-tuning of small pathological benchmarks

Pretraining a deep learning model on large image datasets is a standard ...
research
03/22/2022

Task-guided Disentangled Tuning for Pretrained Language Models

Pretrained language models (PLMs) trained on large-scale unlabeled corpu...
research
03/14/2019

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks

While most previous work has focused on different pretraining objectives...
research
05/21/2023

Towards Robust Family-Infant Audio Analysis Based on Unsupervised Pretraining of Wav2vec 2.0 on Large-Scale Unlabeled Family Audio

To perform automatic family audio analysis, past studies have collected ...
research
12/22/2020

Seeing past words: Testing the cross-modal capabilities of pretrained V L models

We investigate the ability of general-purpose pretrained vision and lang...
research
03/08/2022

Visual-Language Navigation Pretraining via Prompt-based Environmental Self-exploration

Vision-language navigation (VLN) is a challenging task due to its large ...

Please sign up or login with your details

Forgot password? Click here to reset