CLIP-Event: Connecting Text and Images with Event Structures

01/13/2022
by   Manling Li, et al.
6

Vision-language (V+L) pretraining models have achieved great success in supporting multimedia applications by understanding the alignments between images and text. While existing vision-language pretraining models primarily focus on understanding objects in images or entities in text, they often ignore the alignment at the level of events and their argument structures. work, we propose a contrastive learning framework to enforce vision-language pretraining models to comprehend events and associated argument (participant) roles. To achieve this, we take advantage of text information extraction technologies to obtain event structural knowledge, and utilize multiple prompt functions to contrast difficult negative descriptions by manipulating event structures. We also design an event graph alignment loss based on optimal transport to capture event argument structures. In addition, we collect a large event-rich dataset (106,875 images) for pretraining, which provides a more challenging image retrieval benchmark to assess the understanding of complicated lengthy sentences. Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction on Multimedia Event Extraction, achieving more than 5% absolute F-score gain in event extraction, as well as significant improvements on a variety of downstream tasks under zero-shot settings.

READ FULL TEXT

page 1

page 2

page 6

page 7

page 8

research
06/09/2023

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

Vision-language pretraining models have achieved great success in suppor...
research
01/06/2023

In Defense of Structural Symbolic Representation for Video Event-Relation Prediction

Understanding event relationships in videos requires a model to understa...
research
10/23/2022

Code4Struct: Code Generation for Few-Shot Structured Prediction from Natural Language

Large Language Model (LLM) trained on the mixture of text and code has d...
research
04/06/2022

Improving Zero-Shot Event Extraction via Sentence Simplification

The success of sites such as ACLED and Our World in Data have demonstrat...
research
01/28/2023

ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts

Recent success of large-scale Contrastive Language-Image Pre-training (C...
research
05/05/2020

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims ...
research
09/22/2021

Salience-Aware Event Chain Modeling for Narrative Understanding

Storytelling, whether via fables, news reports, documentaries, or memoir...

Please sign up or login with your details

Forgot password? Click here to reset