Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

11/10/2014
by   Ryan Kiros, et al.
0

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

READ FULL TEXT

page 2

page 12

research
02/25/2019

Using Deep Object Features for Image Descriptions

Inspired by recent advances in leveraging multiple modalities in machine...
research
09/03/2023

Representations Matter: Embedding Modes of Large Language Models using Dynamic Mode Decomposition

Existing large language models (LLMs) are known for generating "hallucin...
research
08/22/2023

SONAR: Sentence-Level Multimodal and Language-Agnostic Representations

We introduce SONAR, a new multilingual and multimodal fixed-size sentenc...
research
08/25/2019

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an importan...
research
07/10/2019

Can Unconditional Language Models Recover Arbitrary Sentences?

Neural network-based generative language models like ELMo and BERT can w...
research
06/03/2017

Order embeddings and character-level convolutions for multimodal alignment

With the novel and fast advances in the area of deep neural networks, se...
research
04/30/2020

memeBot: Towards Automatic Image Meme Generation

Image memes have become a widespread tool used by people for interacting...

Please sign up or login with your details

Forgot password? Click here to reset