Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

11/17/2022
by   Pengpeng Zeng, et al.
0

Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical representative embeddings which capture the hierarchical semantic structure in textual space. To utilize such tree-structured prototypes into visual cognition, we also propose a progressive aggregation module to exploit semantic relationships within the image and prototypes. By applying our PTSN to the end-to-end captioning framework, extensive experiments conducted on MSCOCO dataset show that our method achieves a new state-of-the-art performance with 144.2 `Karpathy' split and 141.4 online test server. Trained models and source code have been released at: https://github.com/NovaMind-Z/PTSN.

READ FULL TEXT
research
12/09/2021

Injecting Semantic Concepts into End-to-End Image Captioning

Tremendous progress has been made in recent years in developing better i...
research
06/14/2022

Comprehending and Ordering Semantics for Image Captioning

Comprehending the rich semantics in an image and ordering them in lingui...
research
12/17/2019

M^2: Meshed-Memory Transformer for Image Captioning

Transformer-based architectures represent the state of the art in sequen...
research
10/09/2019

Semantic-aware Image Deblurring

Image deblurring has achieved exciting progress in recent years. However...
research
02/21/2022

CaMEL: Mean Teacher Learning for Image Captioning

Describing images in natural language is a fundamental step towards the ...
research
07/27/2019

Learnable Parameter Similarity

Most of the existing approaches focus on specific visual tasks while ign...
research
05/10/2023

Towards L-System Captioning for Tree Reconstruction

This work proposes a novel concept for tree and plant reconstruction by ...

Please sign up or login with your details

Forgot password? Click here to reset