Composing Ensembles of Pre-trained Models via Iterative Consensus

10/20/2022
by   Shuang Li, et al.
1

Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models – combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5 requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. Project page: https://energy-based-model.github.io/composing-pretrained-models.

READ FULL TEXT

page 3

page 9

page 19

research
06/30/2023

Zero-shot Nuclei Detection via Visual-Language Pre-trained Models

Large-scale visual-language pre-trained models (VLPM) have proven their ...
research
10/13/2022

MAPL: Parameter-Efficient Adaptation of Unimodal Pre-Trained Models for Vision-Language Few-Shot Prompting

Large pre-trained models have proved to be remarkable zero- and (prompt-...
research
05/23/2023

i-Code Studio: A Configurable and Composable Framework for Integrative AI

Artificial General Intelligence (AGI) requires comprehensive understandi...
research
05/27/2023

Modularized Zero-shot VQA with Pre-trained Models

Large-scale pre-trained models (PTMs) show great zero-shot capabilities....
research
09/02/2023

Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging

We present a method for zero-shot recommendation of multimodal non-stati...
research
07/08/2023

Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation

Significant progress has recently been made in creative applications of ...
research
06/26/2023

ViNT: A Foundation Model for Visual Navigation

General-purpose pre-trained models ("foundation models") have enabled pr...

Please sign up or login with your details

Forgot password? Click here to reset