IMAGINATOR: Pre-Trained Image+Text Joint Embeddings using Word-Level Grounding of Images

05/12/2023
by   Varuna Krishna, et al.
0

Word embeddings, i.e., semantically meaningful vector representation of words, are largely influenced by the distributional hypothesis "You shall know a word by the company it keeps" (Harris, 1954), whereas modern prediction-based neural network embeddings rely on design choices and hyperparameter optimization. Word embeddings like Word2Vec, GloVe etc. well capture the contextuality and real-world analogies but contemporary convolution-based image embeddings such as VGGNet, AlexNet, etc. do not capture contextual knowledge. The popular king-queen analogy does not hold true for most commonly used vision embeddings. In this paper, we introduce a pre-trained joint embedding (JE), named IMAGINATOR, trained on 21K distinct image objects level from 1M image+text pairs. JE is a way to encode multimodal data into a vector space where the text modality serves as the ground-ing key, which the complementary modality (in this case, the image) is anchored with. IMAGINATOR encapsulates three individual representations: (i) object-object co-location, (ii) word-object co-location, and (iii) word-object correlation. These three ways capture complementary aspects of the two modalities which are further combined to obtain the final JEs. Generated JEs are intrinsically evaluated to assess how well they capture the contextuality and real-world analogies. We also evaluate pre-trained IMAGINATOR JEs on three downstream tasks: (i) image captioning, (ii) Image2Tweet, and (iii) text-based image retrieval. IMAGINATOR establishes a new standard on the aforementioned down-stream tasks by outperforming the current SoTA on all the selected tasks. IMAGINATOR will be made publicly available. The codes are available at https://github.com/varunakk/IMAGINATOR

READ FULL TEXT

page 4

page 7

page 15

page 16

page 17

page 18

page 19

page 20

research
03/11/2019

ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings

In this paper, we introduce a comprehensive toolkit, ETNLP, which can ev...
research
11/26/2019

Word-Class Embeddings for Multiclass Text Classification

Pre-trained word embeddings encode general word semantics and lexical re...
research
07/28/2022

Knowing Where and What: Unified Word Block Pretraining for Document Understanding

Due to the complex layouts of documents, it is challenging to extract in...
research
06/15/2022

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Vision-language (VL) pre-training has recently received considerable att...
research
11/13/2021

Memotion Analysis through the Lens of Joint Embedding

Joint embedding (JE) is a way to encode multi-modal data into a vector s...
research
11/05/2019

Contextual Grounding of Natural Language Entities in Images

In this paper, we introduce a contextual grounding approach that capture...
research
08/30/2023

CorrEmbed: Evaluating Pre-trained Model Image Similarity Efficacy with a Novel Metric

Detecting visually similar images is a particularly useful attribute to ...

Please sign up or login with your details

Forgot password? Click here to reset