Joint Learning of Distributed Representations for Images and Texts

04/13/2015
by   Xiaodong He, et al.
0

This technical report provides extra details of the deep multimodal similarity model (DMSM) which was proposed in (Fang et al. 2015, arXiv:1411.4952). The model is trained via maximizing global semantic similarity between images and their captions in natural language using the public Microsoft COCO database, which consists of a large set of images and their corresponding captions. The learned representations attempt to capture the combination of various visual concepts and cues.

READ FULL TEXT
research
11/18/2014

From Captions to Visual Concepts and Back

This paper presents a novel approach for automatically generating image ...
research
10/25/2018

Engaging Image Captioning Via Personality

Standard image captioning tasks such as COCO and Flickr30k are factual, ...
research
06/16/2021

Semantic sentence similarity: size does not always matter

This study addresses the question whether visually grounded speech recog...
research
01/23/2020

Deep Bayesian Network for Visual Question Generation

Generating natural questions from an image is a semantic task that requi...
research
01/26/2023

Paraphrase Acquisition from Image Captions

We propose to use captions from the Web as a previously underutilized re...
research
08/12/2018

Multimodal Differential Network for Visual Question Generation

Generating natural questions from an image is a semantic task that requi...
research
05/22/2022

The Case for Perspective in Multimodal Datasets

This paper argues in favor of the adoption of annotation practices for m...

Please sign up or login with your details

Forgot password? Click here to reset