From Captions to Visual Concepts and Back

11/18/2014
by   Hao Fang, et al.
0

This paper presents a novel approach for automatically generating image descriptions: visual detectors, language models, and multimodal similarity models learnt directly from a dataset of image captions. We use multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. Our system is state-of-the-art on the official Microsoft COCO benchmark, producing a BLEU-4 score of 29.1 human judges compare the system captions to ones written by other people on our held-out test set, the system captions have equal or better quality 34 time.

READ FULL TEXT

page 1

page 3

page 6

page 7

research
06/20/2023

Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

State-of-The-Art (SoTA) image captioning models often rely on the Micros...
research
04/13/2015

Joint Learning of Distributed Representations for Images and Texts

This technical report provides extra details of the deep multimodal simi...
research
05/18/2018

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text

Linguistic style is an essential part of written communication, with the...
research
11/24/2021

Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

While captioning models have obtained compelling results in describing n...
research
08/12/2018

Multimodal Differential Network for Visual Question Generation

Generating natural questions from an image is a semantic task that requi...
research
02/25/2019

Using Deep Object Features for Image Descriptions

Inspired by recent advances in leveraging multiple modalities in machine...
research
05/03/2017

FOIL it! Find One mismatch between Image and Language caption

In this paper, we aim to understand whether current language and vision ...

Please sign up or login with your details

Forgot password? Click here to reset