Show, Translate and Tell

03/14/2019
by   Dheeraj Peri, et al.
0

Humans have an incredible ability to process and understand information from multiple sources such as images, video, text, and speech. Recent success of deep neural networks has enabled us to develop algorithms which give machines the ability to understand and interpret this information. There is a need to both broaden their applicability and develop methods which correlate visual information along with semantic content. We propose a unified model which jointly trains on images and captions, and learns to generate new captions given either an image or a caption query. We evaluate our model on three different tasks namely cross-modal retrieval, image captioning, and sentence paraphrasing. Our model gains insight into cross-modal vector embeddings, generalizes well on multiple tasks and is competitive to state of the art methods on retrieval.

READ FULL TEXT
research
03/03/2020

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent r...
research
01/13/2021

Probabilistic Embeddings for Cross-Modal Retrieval

Cross-modal retrieval methods build a common representation space for sa...
research
10/22/2021

Exploiting Cross-Modal Prediction and Relation Consistency for Semi-Supervised Image Captioning

The task of image captioning aims to generate captions directly from ima...
research
04/20/2022

Uncertainty-based Cross-Modal Retrieval with Probabilistic Representations

Probabilistic embeddings have proven useful for capturing polysemous wor...
research
11/19/2021

DVCFlow: Modeling Information Flow Towards Human-like Video Captioning

Dense video captioning (DVC) aims to generate multi-sentence description...
research
12/18/2022

Efficient Image Captioning for Edge Devices

Recent years have witnessed the rapid progress of image captioning. Howe...
research
10/04/2018

Image-to-Video Person Re-Identification by Reusing Cross-modal Embeddings

Image-to-video person re-identification identifies a target person by a ...

Please sign up or login with your details

Forgot password? Click here to reset