X-Linear Attention Networks for Image Captioning

03/31/2020
by   Yingwei Pan, et al.
0

Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2^nd order interactions across multi-modal inputs. Nevertheless, there has not been evidence in support of building such interactions concurrently with attention mechanism for image captioning. In this paper, we introduce a unified attention block – X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning. Technically, X-Linear attention block simultaneously exploits both the spatial and channel-wise bilinear attention distributions to capture the 2^nd order interactions between the input single-modal or multi-modal features. Higher and even infinity order feature interactions are readily modeled through stacking multiple X-Linear attention blocks and equipping the block with Exponential Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates X-Linear attention block(s) into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions. The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0 When further endowing Transformer with X-Linear attention blocks, CIDEr is boosted up to 132.8 <https://github.com/Panda-Peter/image-captioning>.

READ FULL TEXT

page 5

page 7

page 8

research
08/04/2017

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

Visual question answering (VQA) is challenging because it requires a sim...
research
05/20/2019

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Image captioning aims to automatically generate a natural language descr...
research
11/17/2016

SCA-CNN: Spatial and Channel-wise Attention in Convolutional Networks for Image Captioning

Visual attention has been successfully applied in structural prediction ...
research
03/06/2019

A Synchronized Multi-Modal Attention-Caption Dataset and Analysis

In this work, we present a novel multi-modal dataset consisting of eye m...
research
12/20/2017

An Order Preserving Bilinear Model for Person Detection in Multi-Modal Data

We propose a new order preserving bilinear framework that exploits low-r...
research
09/18/2021

Towards Joint Intent Detection and Slot Filling via Higher-order Attention

Intent detection (ID) and Slot filling (SF) are two major tasks in spoke...
research
12/13/2020

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Transformer-based architectures have shown great success in image captio...

Please sign up or login with your details

Forgot password? Click here to reset