Joint Multimedia Event Extraction from Video and Article

09/27/2021
by   Brian Chen, et al.
0

Visual and textual modalities contribute complementary information about events described in multimedia documents. Videos contain rich dynamics and detailed unfoldings of events, while text describes more high-level and abstract concepts. However, existing event extraction methods either do not handle video or solely target video while ignoring other modalities. In contrast, we propose the first approach to jointly extract events from video and text articles. We introduce the new task of Video MultiMedia Event Extraction (Video M2E2) and propose two novel components to build the first system towards this task. First, we propose the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs. Second, we introduce the first multimodal transformer which extracts structured event information jointly from both videos and text documents. We also construct and will publicly release a new benchmark of video-article pairs, consisting of 860 video-article pairs with extensive annotations for evaluating methods on this task. Our experimental results demonstrate the effectiveness of our proposed method on our new benchmark dataset. We achieve 6.0 gain on multimodal event coreference resolution and multimedia event extraction.

READ FULL TEXT

page 1

page 6

page 9

page 15

research
05/05/2020

Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims ...
research
06/14/2022

Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World

Understanding how events described or shown in multimedia content relate...
research
03/13/2015

The YLI-MED Corpus: Characteristics, Procedures, and Plans

The YLI Multimedia Event Detection corpus is a public-domain index of vi...
research
09/05/2020

Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability

A key capability of an intelligent system is deciding when events from p...
research
05/24/2016

EventNet Version 1.1 Technical Report

EventNet is a large-scale video corpus and event ontology consisting of ...
research
08/06/2013

Multimodal Approach for Video Surveillance Indexing and Retrieval

In this paper, we present an overview of a multimodal system to indexing...
research
04/13/2020

Event detection in coarsely annotated sports videos via parallel multi receptive field 1D convolutions

In problems such as sports video analytics, it is difficult to obtain ac...

Please sign up or login with your details

Forgot password? Click here to reset