Unsupervised Discovery of Multimodal Links in Multi-Image, Multi-Sentence Documents

04/16/2019
by   Jack Hessel, et al.
0

Images and text co-occur everywhere on the web, but explicit links between images and sentences (or other intra-document textual units) are often not annotated by users. We present algorithms that successfully discover image-sentence relationships without relying on any explicit multimodal annotation. We explore several variants of our approach on seven datasets of varying difficulty, ranging from images that were captioned post hoc by crowd-workers to naturally-occurring user-generated multimodal documents, wherein correspondences between illustrations and individual textual units may not be one-to-one. We find that a structured training objective based on identifying whether sets of images and sentences co-occur in documents can be sufficient to predict links between specific sentences and specific images within the same document at test time.

READ FULL TEXT
research
03/18/2021

Learning Multimodal Affinities for Textual Editing in Images

Nowadays, as cameras are rapidly adopted in our daily routine, images of...
research
09/11/2018

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

In this paper we introduce vSTS, a new dataset for measuring textual sim...
research
06/09/2023

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Vision-language pretraining models have achieved great success in suppor...
research
06/10/2019

Multimodal Logical Inference System for Visual-Textual Entailment

A large amount of research about multimodal inference across text and vi...
research
03/25/2022

Multimodal Pre-training Based on Graph Attention Network for Document Understanding

Document intelligence as a relatively new research topic supports many b...
research
04/18/2018

Quantifying the visual concreteness of words and topics in multimodal datasets

Multimodal machine learning algorithms aim to learn visual-textual corre...
research
10/30/2020

Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

Images can give us insights into the contextual meanings of words, but c...

Please sign up or login with your details

Forgot password? Click here to reset