An Empirical Study of Multimodal Model Merging

04/28/2023
by   Yi-Lin Sung, et al.
3

Model merging (e.g., via interpolation or task arithmetic) fuses multiple models trained on different tasks to generate a multi-task solution. The technique has been proven successful in previous studies, where the models are trained on similar tasks and with the same initialization. In this paper, we expand on this concept to a multimodal setup by merging transformers trained on different modalities. Furthermore, we conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture to create a parameter-efficient modality-agnostic architecture. Through comprehensive experiments, we systematically investigate the key factors impacting model performance after merging, including initialization, merging mechanisms, and model architectures. Our analysis leads to an effective training recipe for matching the performance of the modality-agnostic baseline (i.e. pre-trained from scratch) via model merging. Our code is available at: https://github.com/ylsung/vl-merging

READ FULL TEXT

page 5

page 6

research
06/02/2023

Resolving Interference When Merging Models

Transfer learning - i.e., further fine-tuning a pre-trained model on a d...
research
04/21/2022

Merging of neural networks

We propose a simple scheme for merging two neural networks trained with ...
research
03/14/2023

Merging Decision Transformers: Weight Averaging for Forming Multi-Task Policies

Recent work has shown the promise of creating generalist, transformer-ba...
research
06/09/2023

Revisiting Permutation Symmetry for Merging Models between Different Datasets

Model merging is a new approach to creating a new model by combining the...
research
05/20/2023

Brain encoding models based on multimodal transformers can transfer across language and vision

Encoding models have been used to assess how the human brain represents ...
research
12/15/2022

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models Tasks

Vision and language models (VL) are known to exploit unrobust indicators...
research
03/04/2019

Lightweight merging of compressed indices based on BWT variants

In this paper we propose a flexible and lightweight technique for mergin...

Please sign up or login with your details

Forgot password? Click here to reset