Towards Language-guided Visual Recognition via Dynamic Convolutions

10/17/2021
by   Gen Luo, et al.
0

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental results not only shows the performance gains of LaConv compared to the existing multi-modal modules, but also witness the merits of LaConvNet as an unified network, including compact network, high generalization ability and excellent performance, e.g., +4.7

READ FULL TEXT

page 8

page 9

research
08/08/2018

Question-Guided Hybrid Convolution for Visual Question Answering

In this paper, we propose a novel Question-Guided Hybrid Convolution (QG...
research
04/21/2022

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion

As an important and challenging problem in vision-language tasks, referr...
research
04/17/2022

What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study

Most of the existing work in one-stage referring expression comprehensio...
research
09/21/2022

Exploring Modulated Detection Transformer as a Tool for Action Recognition in Videos

During recent years transformers architectures have been growing in popu...
research
03/19/2020

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Referring expression comprehension (REC) and segmentation (RES) are two ...
research
06/01/2023

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Pre-trained language models (PLMs) have played an increasing role in mul...
research
06/01/2023

Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

The impressive advances and applications of large language and joint lan...

Please sign up or login with your details

Forgot password? Click here to reset