ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs

04/21/2020
by   Shiyang Yan, et al.
3

Image description generation plays an important role in many real-world applications, such as image retrieval, automatic navigation, and disabled people support. A well-developed task of image description generation is image captioning, which usually generates a short captioning sentence and thus neglects many of fine-grained properties, e.g., the information of subtle objects and their relationships. In this paper, we study the visual paragraph generation, which can describe the image with a long paragraph containing rich details. Previous research often generates the paragraph via a hierarchical Recurrent Neural Network (RNN)-like model, which has complex memorising, forgetting and coupling mechanism. Instead, we propose a novel pure CNN model, ParaCNN, to generate visual paragraph using hierarchical CNN architecture with contextual information between sentences within one paragraph. The ParaCNN can generate an arbitrary length of a paragraph, which is more applicable in many real-world applications. Furthermore, to enable the ParaCNN to model paragraph comprehensively, we also propose an adversarial twin net training scheme. During training, we force the forwarding network's hidden features to be close to that of the backwards network by using adversarial training. During testing, we only use the forwarding network, which already includes the knowledge of the backwards network, to generate a paragraph. We conduct extensive experiments on the Stanford Visual Paragraph dataset and achieve state-of-the-art performance.

READ FULL TEXT

page 2

page 5

page 6

research
06/06/2019

Context-Aware Visual Policy Network for Fine-Grained Image Captioning

With the maturity of visual detection techniques, we are more ambitious ...
research
08/17/2017

Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects

Image captioning often requires a large set of training image-sentence p...
research
08/21/2023

Explore and Tell: Embodied Visual Captioning in 3D Environments

While current visual captioning models have achieved impressive performa...
research
05/21/2021

Visual representation of negation: Real world data analysis on comic image design

There has been a widely held view that visual representations (e.g., pho...
research
01/16/2021

Dual-Level Collaborative Transformer for Image Captioning

Descriptive region features extracted by object detection networks have ...
research
09/02/2020

Structure-Aware Generation Network for Recipe Generation from Images

Sharing food has become very popular with the development of social medi...
research
08/28/2019

Image Captioning with Sparse Recurrent Neural Network

Recurrent Neural Network (RNN) has been deployed as the de facto model t...

Please sign up or login with your details

Forgot password? Click here to reset