Using Deep Object Features for Image Descriptions

02/25/2019
by   Ashutosh Mishra, et al.
0

Inspired by recent advances in leveraging multiple modalities in machine translation, we introduce an encoder-decoder pipeline that uses (1) specific objects within an image and their object labels, (2) a language model for decoding joint embedding of object features and the object labels. Our pipeline merges prior detected objects from the image and their object labels and then learns the sequences of captions describing the particular image. The decoder model learns to extract descriptions for the image from scratch by decoding the joint representation of the object visual features and their object classes conditioned by the encoder component. The idea of the model is to concentrate only on the specific objects of the image and their labels for generating descriptions of the image rather than visual feature of the entire image. The model needs to be calibrated more by adjusting the parameters and settings to result in better accuracy and performance.

READ FULL TEXT

page 3

page 4

research
11/10/2014

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Inspired by recent advances in multimodal learning and machine translati...
research
11/19/2018

Intention Oriented Image Captions with Guiding Objects

Although existing image caption models can produce promising results usi...
research
07/10/2023

Leveraging Multiple Descriptive Features for Robust Few-shot Image Learning

Modern image classification is based upon directly predicting model clas...
research
11/18/2014

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image ...
research
09/16/2021

Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

Automatic transcription of scene understanding in images and videos is a...
research
11/21/2019

Empirical Autopsy of Deep Video Captioning Frameworks

Contemporary deep learning based video captioning follows encoder-decode...
research
10/11/2016

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

We present a model of visually-grounded language learning based on stack...

Please sign up or login with your details

Forgot password? Click here to reset