Grounding Visual Representations with Texts for Domain Generalization

07/21/2022
by   Seonwoo Min, et al.
0

Reducing the representational discrepancy between source and target domains is a key component to maximize the model generalization. In this work, we advocate for leveraging natural language supervision for the domain generalization task. We introduce two modules to ground visual representations with texts containing typical reasoning of humans: (1) Visual and Textual Joint Embedder and (2) Textual Explanation Generator. The former learns the image-text joint embedding space where we can ground high-level class-discriminative information into the model. The latter leverages an explainable model and generates explanations justifying the rationale behind its decision. To the best of our knowledge, this is the first work to leverage the vision-and-language cross-modality approach for the domain generalization task. Our experiments with a newly created CUB-DG benchmark dataset demonstrate that cross-modality supervision can be successfully used to ground domain-invariant visual representations and improve the model generalization. Furthermore, in the large-scale DomainBed benchmark, our proposed method achieves state-of-the-art results and ranks 1st in average performance for five multi-domain datasets. The dataset and codes are available at https://github.com/mswzeus/GVRT.

READ FULL TEXT

page 2

page 5

page 11

page 12

page 13

page 16

page 18

page 19

research
07/21/2023

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between v...
research
09/29/2022

Domain-Unified Prompt Representations for Source-Free Domain Generalization

Domain generalization (DG), aiming to make models work on unseen domains...
research
05/10/2023

Combo of Thinking and Observing for Outside-Knowledge VQA

Outside-knowledge visual question answering is a challenging task that r...
research
11/28/2022

G^3: Geolocation via Guidebook Grounding

We demonstrate how language can improve geolocation: the task of predict...
research
09/01/2022

Universal Multi-Modality Retrieval with One Unified Embedding Space

This paper presents Vision-Language Universal Search (VL-UnivSearch), wh...
research
05/12/2020

Cross-Modality Relevance for Reasoning on Language and Vision

This work deals with the challenge of learning and reasoning over langua...
research
05/01/2020

Probing Text Models for Common Ground with Visual Representations

Vision, as a central component of human perception, plays a fundamental ...

Please sign up or login with your details

Forgot password? Click here to reset