Geometry-Entangled Visual Semantic Transformer for Image Captioning

09/29/2021
by   Ling Cheng, et al.
0

Recent advancements of image captioning have featured Visual-Semantic Fusion or Geometry-Aid attention refinement. However, those fusion-based models, they are still criticized for the lack of geometry information for inter and intra attention refinement. On the other side, models based on Geometry-Aid attention still suffer from the modality gap between visual and semantic information. In this paper, we introduce a novel Geometry-Entangled Visual Semantic Transformer (GEVST) network to realize the complementary advantages of Visual-Semantic Fusion and Geometry-Aid attention refinement. Concretely, a Dense-Cap model proposes some dense captions with corresponding geometry information at first. Then, to empower GEVST with the ability to bridge the modality gap among visual and semantic information, we build four parallel transformer encoders VV(Pure Visual), VS(Semantic fused to Visual), SV(Visual fused to Semantic), SS(Pure Semantic) for final caption generation. Both visual and semantic geometry features are used in the Fusion module and also the Self-Attention module for better attention measurement. To validate our model, we conduct extensive experiments on the MS-COCO dataset, the experimental results show that our GEVST model can obtain promising performance gains.

READ FULL TEXT

page 1

page 4

page 5

page 8

page 9

page 12

research
02/13/2023

Towards Local Visual Modeling for Image Captioning

In this paper, we study the local visual modeling with grid features for...
research
01/16/2021

Dual-Level Collaborative Transformer for Image Captioning

Descriptive region features extracted by object detection networks have ...
research
10/01/2021

Geometry Attention Transformer with Position-aware LSTMs for Image Captioning

In recent years, transformer structures have been widely applied in imag...
research
03/19/2020

Normalized and Geometry-Aware Self-Attention Network for Image Captioning

Self-attention (SA) network has shown profound value in image captioning...
research
08/06/2019

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

Image captioning attempts to generate a sentence composed of several lin...
research
09/08/2021

RefineCap: Concept-Aware Refinement for Image Captioning

Automatically translating images to texts involves image scene understan...
research
07/08/2023

VS-TransGRU: A Novel Transformer-GRU-based Framework Enhanced by Visual-Semantic Fusion for Egocentric Action Anticipation

Egocentric action anticipation is a challenging task that aims to make a...

Please sign up or login with your details

Forgot password? Click here to reset