Deep Interactive Region Segmentation and Captioning

by   Ali Sharifi Boroujerdi, et al.

With recent innovations in dense image captioning, it is now possible to describe every object of the scene with a caption while objects are determined by bounding boxes. However, interpretation of such an output is not trivial due to the existence of many overlapping bounding boxes. Furthermore, in current captioning frameworks, the user is not able to involve personal preferences to exclude out of interest areas. In this paper, we propose a novel hybrid deep learning architecture for interactive region segmentation and captioning where the user is able to specify an arbitrary region of the image that should be processed. To this end, a dedicated Fully Convolutional Network (FCN) named Lyncean FCN (LFCN) is trained using our special training data to isolate the User Intention Region (UIR) as the output of an efficient segmentation. In parallel, a dense image captioning model is utilized to provide a wide variety of captions for that region. Then, the UIR will be explained with the caption of the best match bounding box. To the best of our knowledge, this is the first work that provides such a comprehensive output. Our experiments show the superiority of the proposed approach over state-of-the-art interactive segmentation methods on several well-known datasets. In addition, replacement of the bounding boxes with the result of the interactive segmentation leads to a better understanding of the dense image captioning output as well as accuracy enhancement for the object detection in terms of Intersection over Union (IoU).


page 2

page 4

page 6

page 9

page 10

page 12

page 13


Weakly Supervised Object Localization on grocery shelves using simple FCN and Synthetic Dataset

We propose a weakly supervised method using two algorithms to predict ob...

Pick-Object-Attack: Type-Specific Adversarial Attack for Object Detection

Many recent studies have shown that deep neural models are vulnerable to...

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

We introduce the dense captioning task, which requires a computer vision...

HashBox: Hash Hierarchical Segmentation exploiting Bounding Box Object Detection

We propose a novel approach to address the Simultaneous Detection and Se...

Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Object detection plays an important role in current solutions to vision ...

Video Region Annotation with Sparse Bounding Boxes

Video analysis has been moving towards more detailed interpretation (e.g...

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

Benefiting from large-scale vision-language pre-training on image-text p...

Please sign up or login with your details

Forgot password? Click here to reset