DeepAI AI Chat
Log In Sign Up

Task Bias in Vision-Language Models

by   Sachit Menon, et al.

Incidental supervision from language has become a popular approach for learning generic visual representations that can be prompted to perform many recognition tasks in computer vision. We conduct an in-depth exploration of the CLIP model and show that its visual representation is often strongly biased towards solving some tasks more than others. Moreover, which task the representation will be biased towards is unpredictable, with little consistency across images. To resolve this task bias, we show how to learn a visual prompt that guides the representation towards features relevant to their task of interest. Our results show that these visual prompts can be independent of the input image and still effectively provide a conditioning mechanism to steer visual representations towards the desired task.


page 1

page 2

page 3

page 4

page 5

page 7

page 8

page 13


Learning Visual Representations with Caption Annotations

Pretraining general-purpose visual features has become a crucial part of...

Evaluating language-biased image classification based on semantic representations

Humans show language-biased image recognition for a word-embedded image,...

Who Let The Dogs Out? Modeling Dog Behavior From Visual Data

We introduce the task of directly modeling a visually intelligent agent....

Show or Tell? Visual and Verbal Representations Bias Position Recall

When we view visualizations, we not only have a visual representation of...

Factors of Transferability for a Generic ConvNet Representation

Evidence is mounting that Convolutional Networks (ConvNets) are the most...

Prompt Distribution Learning

We present prompt distribution learning for effectively adapting a pre-t...

Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Active Tasks

One of the ultimate promises of computer vision is to help robotic agent...