SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

07/15/2023
by   Yi-Syuan Chen, et al.
0

Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: “How can we enable in-context learning for general models without being constrained on large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making in-context predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.

READ FULL TEXT
research
06/02/2023

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

Large-scale language models have shown the ability to adapt to a new tas...
research
12/02/2022

General Framework for Self-Supervised Model Priming for Parameter-Efficient Fine-tuning

Parameter-efficient methods (like Prompt or Adapters) for adapting pre-t...
research
05/03/2023

Making the Most of What You Have: Adapting Pre-trained Visual Language Models in the Low-data Regime

Large-scale visual language models are widely used as pre-trained models...
research
09/08/2023

Context-Aware Prompt Tuning for Vision-Language Model with Dual-Alignment

Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...
research
09/19/2023

Language Modeling Is Compression

It has long been established that predictive models can be transformed i...
research
05/22/2023

Iterative Forward Tuning Boosts In-context Learning in Language Models

Large language models (LLMs) have exhibited an emergent in-context learn...
research
02/09/2023

Toolformer: Language Models Can Teach Themselves to Use Tools

Language models (LMs) exhibit remarkable abilities to solve new tasks fr...

Please sign up or login with your details

Forgot password? Click here to reset