NewsStories: Illustrating articles with visual summaries

07/26/2022
by   Reuben Tan, et al.
0

Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their (short) captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visual summaries. Thus, we explore a novel setting where the goal is to learn a self-supervised visual-language representation that is robust to varying text length and the number of images. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. Finally, we introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10

READ FULL TEXT
research
11/29/2021

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Recent text-to-image matching models apply contrastive learning to large...
research
01/05/2023

ANNA: Abstractive Text-to-Image Synthesis with Filtered News Captions

Advancements in Text-to-Image synthesis over recent years have focused m...
research
11/14/2022

ContextCLIP: Contextual Alignment of Image-Text pairs on CLIP visual representations

State-of-the-art empirical work has shown that visual representations le...
research
03/23/2016

BreakingNews: Article Annotation by Image and Text Processing

Building upon recent Deep Neural Network architectures, current approach...
research
11/08/2021

Cascaded Multilingual Audio-Visual Learning from Videos

In this paper, we explore self-supervised audio-visual models that learn...
research
01/11/2023

EXIF as Language: Learning Cross-Modal Associations Between Images and Camera Metadata

We learn a visual representation that captures information about the cam...
research
01/15/2021

Catching Out-of-Context Misinformation with Self-supervised Learning

Despite the recent attention to DeepFakes and other forms of image manip...

Please sign up or login with your details

Forgot password? Click here to reset