Traditional event extraction methods target a single modality, such as text wadden2019entity, images yatskar2016situation or videos ye2015eventnet; caba2015activitynet; soomro2012ucf101. However, the practice of contemporary journalism (Mitchell1998) distributes news via multimedia. By randomly sampling 100 multimedia news articles from the Voice of America (VOA), we find that 33% of images in the articles contain visual objects that serve as event arguments and are not mentioned in the text. Take Figure 1 as an example, we can extract the Agent and Person arguments of the Movement.Transport event from text, but can extract the Vehicle
argument only from the image. Nevertheless, event extraction is independently studied in Computer Vision (CV) and Natural Language Processing (NLP), with major differences in task definition, data domain, methodology, and terminology. Motivated by the complementary and holistic nature of multimedia data, we proposeMultiMedia Event Extraction (M2E2), a new task that aims to jointly extract events and arguments from multiple modalities. We construct the first benchmark and evaluation dataset for this task, which consists of 245 fully annotated news articles.
We propose the first method, Weakly Aligned Structured Embedding (WASE), for extracting events and arguments from multiple modalities. Complex event structures have not been covered by existing multimedia representation methods Wu2019UniVSERV; faghri2018vse++; karpathy2015deep, so we propose to learn a structured
multimedia embedding space. More specifically, given a multimedia document, we represent each image or sentence as a graph, where each node represents an event or entity and each edge represents an argument role. The node and edge embeddings are represented in a multimedia common semantic space, as they are trained to resolve event co-reference across modalities and to match images with relevant sentences. This enables us to jointly classify events and argument roles from both modalities. A major challenge is the lack of multimedia event argument annotations, which are costly to obtain due to the annotation complexity. Therefore, we propose a weakly supervised framework, which takes advantage of annotated uni-modal corpora to separately learn visual and textual event extraction, and uses an image-caption dataset to align the modalities.
We evaluate WASE on the new task of M2E2. Compared to the state-of-the-art uni-modal methods and multimedia flat representations, our method significantly outperforms on both event extraction and argument role labeling tasks in all settings. Moreover, it extracts 21.4% more event mentions than text-only baselines. The training and evaluation are done on heterogeneous data sets from multiple sources, domains and data modalities, demonstrating the scalability and transferability of the proposed model. In summary, this paper makes the following contributions:
We propose a new task, MultiMedia Event Extraction, and construct the first annotated news dataset as a benchmark to support deep analysis of cross-media events.
We develop a weakly supervised training framework, which utilizes existing single-modal annotated corpora, and enables joint inference without cross-modal annotation.
Our proposed method, WASE, is the first to leverage structured representations and graph-based neural networks for multimedia common space embedding.
2 Task Definition
2.1 Problem Formulation
Each input document consists of a set of images and a set of sentences . Each sentence can be represented as a sequence of tokens , where is a token from the document vocabulary . The input also includes a set of entities extracted from the document text. An entity is an individually unique object in the real world, such as a person, an organization, a facility, a location, a geopolitical entity, a weapon, or a vehicle. The objective of M2E2is twofold:
Event Extraction: Given a multimedia document, extract a set of event mentions, where each event mention has a type and is grounded on a text trigger word or an image or both, i.e.,
Note that for an event, and can both exist, which means the visual event mention and the textual event mention refer to the same event. For example in Figure 1, deploy indicates the same Movement.Transport event as the image. We consider the event as text-only event if it only has textual mention , and as image-only event if it only contains visual mention , and as multimedia event if both and exist.
Argument Extraction: The second task is to extract a set of arguments of event mention . Each argument has an argument role type , and is grounded on a text entity or an image object (represented as a bounding box), or both,
The arguments of visual and textual event mentions are merged if they refer to the same real-world event, as shown in Figure 1.
2.2 The M2E2 Dataset
We define multimedia newsworthy event types by exhaustively mapping between the event ontology in NLP community for the news domain (ACE222https://catalog.ldc.upenn.edu/ldc2006T06) and the event ontology in CV community for general domain (imSitu yatskar2016situation). They cover the largest event training resources in each community. Table 1 shows the selected complete intersection, which contains 8 ACE types (i.e., 24% of all ACE types), mapped to 98 imSitu types (i.e., 20% of all imSitu types). We expand the ACE event role set by adding visual arguments from imSitu, such as instrument, bolded in Table 1. This set encompasses 52% ACE events in a news corpus, which indicates that the selected eight types are salient in the news domain. We reuse these existing ontologies because they enable us to train event and argument classifiers for both modalities without requiring joint multimedia event annotation as training data.
|Event Type||Argument Role|
|Movement.Transport (22353)||Agent (4664), Artifact (179103), Vehicle (2451), Destination (1200), Origin (660)|
|Conflict.Attack (32627)||Attacker (19212), Target (20719), Instrument (3715), Place (1210)|
|Conflict.Demonstrate (15169)||Entity (102184), Police (326), Instrument (0118), Place (8625)|
|Justice.ArrestJail (16056)||Agent (64119), Person (14799), Instrument (011), Place (430)|
|Contact.PhoneWrite (3337)||Entity (3346), Instrument (043), Place (80)|
|Contact.Meet (12779)||Participant (119321), Place (680)|
|Life.Die (24464)||Agent (390), Instrument (42), Victim (165155), Place (540)|
|Transaction. TransferMoney (336)||Giver (193), Recipient (195), Money (08)|
We collect 108,693 multimedia news articles from the Voice of America (VOA) website 333https://www.voanews.com/ 2006-2017, covering a wide range of newsworthy topics such as military, economy and health. We select 245 documents as the annotation set based on three criteria: (1) Informativeness: articles with more event mentions; (2) Illustration: articles with more images (); (3) Diversity: articles that balance the event type distribution regardless of true frequency. The data statistics are shown in Table 2. Among all of these events, 192 textual event mentions and 203 visual event mentions can be aligned as 309 cross-media event mention pairs. The dataset can be divided into 1,105 text-only event mentions, 188 image-only event mentions, and 395 multimedia event mentions.
|Source||Event Mention||Argument Role|
We follow the ACE event annotation guidelines walker2006ace for textual event and argument annotation, and design an annotation guideline 444http://blender.cs.illinois.edu/software/m2e2/ACL2020_M2E2_annotation.pdf for multimedia events annotation.
One unique challenge in multimedia event annotation is to localize visual arguments in complex scenarios, where images include a crowd of people or a group of object. It is hard to delineate each of them using a bounding box. To solve this problem, we define two types of bounding boxes: (1) union bounding box: for each role, we annotate the smallest bounding box covering all constituents; and (2) instance bounding box: for each role, we annotate a set of bounding boxes, where each box is the smallest region that covers an individual participant (e.g., one person in the crowd), following the VOC2011 Annotation Guidelines555http://host.robots.ox.ac.uk/pascal/VOC/voc2011/guidelines.html. Figure 2 shows an example. Eight NLP and CV researchers complete the annotation work with two independent passes and reach an Inter-Annotator Agreement (IAA) of 81.2%. Two expert annotators perform adjudication.
3.1 Approach Overview
As shown in Figure 3, the training phase contains three tasks: text event extraction (Section 3.2), visual situation recognition (Section 3.3), and cross-media alignment (Section 3.4). We learn a cross-media shared encoder, a shared event classifier, and a shared argument classifier. In the testing phase (Section 3.5), given a multimedia news article, we encode the sentences and images into the structured common space, and jointly extract textual and visual events and arguments, followed by cross-modal coreference resolution.
3.2 Text Event Extraction
Text Structured Representation: As shown in Figure 4, we choose Abstract Meaning Representation (AMR) banarescu2013abstract to represent text because it includes a rich set of 150 fine-grained semantic roles. To encode each text sentence, we run the CAMR parser (wang-xue-pradhan:2015:NAACL-HLT; wang-xue-pradhan:2015:ACL-IJCNLP; wang-EtAl:2016:SemEval)
to generate an AMR graph, based on the named entity recognition and part-of-speech (POS) tagging results from Stanford CoreNLP(manning-EtAl:2014:P14-5). To represent each word in a sentence , we concatenate its pre-trained GloVe word embedding (pennington2014glove)
, POS embedding, entity type embedding and position embedding. We then input the word sequence to a bi-directional long short term memory (Bi-LSTM)graves2013speech network to encode the word order and get the representation of each word . Given the AMR graph, we apply a Graph Convolutional Network (GCN) (kipf2016semi) to encode the graph contextual information following (liu2018jointly):
where is the neighbour nodes of in the AMR graph, is the edge type between and , is the gate following (liu2018jointly), represents GCN layer number, and
is the Sigmoid function.and denote parameters of neural layers in this paper. We take the hidden states of the last GCN layer for each word as the common-space representation , where stands for the common (multimedia) embedding space. For each entity , we obtain its representation by averaging the embeddings of its tokens.
Event and Argument Classifier: We classify each word into event types 666We use BIO tag schema to decide trigger word boundary, i.e., adding prefix B- to the type label to mark the beginning of a trigger, I- for inside, and O for none. and classify each entity into argument role :
We take ground truth text entity mentions as input following ji2008refining during training, and obtain testing entity mentions using a named entity extractor (LinACL2019).
3.3 Image Event Extraction
Image Structured Representation: To obtain image structures similar to AMR graphs, and inspired by situation recognition yatskar2016situation, we represent each image with a situation graph, that is a star-shaped graph as shown in Figure 4, where the central node is labeled as a verb (e.g., destroying), and the neighbor nodes are arguments labeled as , where is a noun (e.g., ship) derived from WordNet synsets miller1995wordnet to indicate the entity type, and indicates the role (e.g., item) played by the entity in the event, based on FrameNet fillmore2003background. We develop two methods to construct situation graphs from images and train them using the imSitu dataset yatskar2016situation as follows.
(1) Object-based Graph: Similar to extracting entities to get candidate arguments, we employ the most similar task in CV, object detection, and obtain the object bounding boxes detected by a Faster R-CNN ren2015faster model trained on Open Images kuznetsova2018open with 600 object types (classes). We employ a VGG-16 CNN (simonyan2014very) to extract visual features of an image and and another VGG-16 to encode the bounding boxes
. Then we apply a Multi-Layer Perceptron (MLP) to predict a verb embedding fromand another MLP to predict a noun embedding for each .
We compare the predicted verb embedding to all verbs in the imSitu taxonomy in order to classify the verb, and similarly compare each predicted noun embedding to all imSitu nouns
which results in probability distributions:
where and are word embeddings initialized with GloVE pennington2014glove. We use another MLP with one hidden layer followed by Softmax () to classify role for each object :
Given verb and role-noun
annotations for an image (from the imSitu corpus), we define the situation loss functions:
(2) Attention-based Graph: State-of-the-art object detection methods only cover a limited set of object types, such as 600 types defined in Open Images. Many salient objects such as bomb, stone and stretcher are not covered in these ontologies. Hence, we propose an open-vocabulary alternative to the object-based graph construction model. To this end, we construct a role-driven attention graph, where each argument node is derived by a spatially distributed attention (heatmap) conditioned on a role . More specifically, we use a VGG-16 CNN to extract a convolutional feature map for each image , which can be regarded as attention keys for local regions. Next, for each role defined in the situation recognition ontology (e.g., agent), we build an attention queryvector by concatenating role embedding with the image feature as context and apply a fully connected layer:
Then, we compute the dot product of each query with all keys, followed by Softmax, which forms a heatmap on the image, i.e.,
We use the heatmap to obtain a weighted average of the feature map to represent the argument of each role in the visual space:
Similar to the object-based model, we embed to , compare it to the imSitu noun embeddings to define a distribution, and define a classification loss function. The verb embedding and the verb prediction probability and loss are defined in the same way as in the object-based method.
Event and Argument Classifier: We use either the object-based or attention-based formulation and pre-train it on the imSitu dataset yatskar2016situation. Then we apply a GCN to obtain the structured embedding of each node in the common space, similar to Equation 1. This yields and . We use the same classifiers as defined in Equation 2 to classify each visual event and argument using the common space embedding:
3.4 Cross-Media Joint Training
In order to make the event and argument classifier shared across modalities, the image and text graph should be encoded to the same space. However, it is extremely costly to obtain the parallel text and image event annotation. Hence, we use event and argument annotations in separate modalities (i.e., ACE and imSitu datasets) to train classifiers, and simultaneously use VOA news image and caption pairs to align the two modalities. To this end, we learn to embed the nodes of each image graph close to the nodes of the corresponding caption graph, and far from those in irrelevant caption graphs. Since there is no ground truth alignment between the image nodes and caption nodes, we use image and caption pairs for weakly supervised training, to learn a soft alignment from each words to image objects and vice versa.
where indicates the word in caption sentence and represents the object of image . Then, we compute a weighted average of softly aligned nodes for each node in other modality, i.e.,
We define the alignment cost of the image-caption pair as the Euclidean distance between each node to its aligned representation,
We use a triplet loss to pull relevant image-caption pairs close while pushing irrelevant ones apart:
where is a randomly sampled negative image that does not match . Note that in order to learn the alignment between the image and the trigger word, we treat the image as a special object when learning cross-media alignment.
The common space enables the event and argument classifiers to share weights across modalities, and be trained jointly on the ACE and imSitu datasets, by minimizing the following objective functions:
All tasks are jointly optimized:
|Training||Model||Text-Only Evaluation||Image-Only Evaluation||Multimedia Evaluation|
|Event Mention||Argument Role||Event Mention||Argument Role||Event Mention||Argument Role|
3.5 Cross-Media Joint Inference
In the test phase, our method takes a multimedia document with sentences and images as input. We first generate the structured common embedding for each sentence and each image, and then compute pairwise similarities . We pair each sentence with the closest image , and aggregate the features of each word of with the aligned representation from by weighted averaging:
where and is derived from using Equation 4. We use to classify each word into an event type and to classify each entity into a role with multimedia classifiers in Equation LABEL:eq:text_classifier. To this end, we define similar to but using and . Similarly, for each image we find the closest sentence , compute the aggregated multimedia features and , and feed into the shared classifiers (Equation 3) to predict visual event and argument roles. Finally, we corefer the cross-media events of the same event type if the similarity is higher than a threshold.
4.1 Evaluation Setting
Evaluation Metrics We conduct evaluation on text-only, image-only, and multimedia event mentions in M2E2 dataset in Section 2.2. We adopt the traditional event extraction measures, i.e., Precision, Recall and F1. For text-only event mentions, we follow ji2008refining; li2013joint: a textual event mention is correct if its event type and trigger offsets match a reference trigger; and a textual event argument is correct if its event type, offsets, and role label match a reference argument. We make a similar definition for image-only event mentions: a visual event mention is correct if its event type and image match a reference visual event mention; and a visual event argument is correct if its event type, localization, and role label match a reference argument. A visual argument is correctly localized if the Intersection over Union (IoU) of the predicted bounding box with the ground truth bounding box is over 0.5. Finally, we define a multimedia event mention to be correct if its event type and trigger offsets (or the image) match the reference trigger (or the reference image). The arguments of multimedia events are either textual or visual arguments, and are evaluated accordingly. To generate bounding boxes for the attention-based model, we threshold the heatmap using the adaptive value of , where is the peak value of the heatmap. Then we compute the tightest bounding box that encloses all of the thresholded region. Examples are shown in Figure 7 and Figure 8.
Baselines The baselines include: (1) Text-only models: We use the state-of-the-art model JMEE (liu2018jointly) and GAIL Zhang2019 for comparison. We also evaluate the effectiveness of cross media joint training by including a version of our model trained only on ACE, denoted as WASE. (2) Image-only models: Since we are the first to extract newsworthy events, and the most similar work situation recognition can not localize arguments in images, we use our model trained only on image corpus as baselines. Our visual branch has two versions, object-based and attention-based, denoted as WASEobj and WASEatt. (3) Multimedia models: To show the effectiveness of structured embedding, we include a baseline by removing the text and image GCNs from our model, which is denoted as Flat. The Flat baseline ignores edges and treats images and sentences as sets of vectors. We also compare to the state-of-the-art cross-media common representation model, Contrastive Visual Semantic Embedding VSE-C shi2018learning, by training it the same way as WASE.
Parameter Settings The common space dimension is . The dimension is for image position embedding and feature map, and for word position embedding, entity type embedding, and POS tag embedding. The layer of GCN is .
4.2 Quantitative Performance
As shown in Table 3, our complete methods (WASEatt and WASEobj) outperform all baselines in the three evaluation settings in terms of F1. The comparison with other multimedia models demonstrates the effectiveness of our model architecture and training strategy. The advantage of structured embedding is shown by the better performance over the flat baseline. Our model outperforms its text-only and image-only variants on multimedia events, showing the inadequacy of single-modal information for complex news understanding. Furthermore, our model achieves better performance on text-only and image-only events, which demonstrates the effectiveness of multimedia training framework in knowledge transfer between modalities.
WASEobj and WASEatt, are both superior to the state of the art and each has its own advantages. WASEobj predicts more accurate bounding boxes since it is based on a Faster R-CNN pretrained on bounding box annotations, resulting in a higher argument precision. While WASEatt achieves a higher argument recall as it is not limited by the predefined object classes of the Faster R-CNN.
Furthermore, to evaluate the cross-media event coreference performance, we pair textual and visual event mentions in the same document, and calculate Precision, Recall and F1 to compare with ground truth event mention pairs777We do not use coreference clustering metrics because we only focus on mention-level cross-media event coreference instead of the full coreference in all documents.. As shown in Table 4, WASEobj outperforms all multimedia embedding models, as well as the rule-based baseline using event type matching. This demonstrates the effectiveness of our cross-media soft alignment.
4.3 Qualitative Analysis
Our cross-media joint training approach successfully boosts both event extraction and argument role labeling performance. For example, in Figure 5 (a), the text-only model can not extract Justice.Arrest event, but the joint model can use the image as background to detect the event type. In Figure 5 (b), the image-only model detects the image as Conflict.Demonstration, but the sentences in the same document help our model not to label it as Conflict.Demonstration. Compared with multimedia flat embedding in Figure 6, WASE can learn structures such as Artifact is on top of Vehicle, and the person in the middle of Justice.Arrest is Entity instead of Agent.
4.4 Remaining Challenges
One of the biggest challenges in M2E2is localizing arguments in images. Object-based models suffer from the limited object types. Attention-based method is not able to precisely localize the objects for each argument, since there is no supervision on attention extraction during training. For example, in Figure 7, the Entity argument in the Conflict.Demonstrate event is correctly predicted as troops, but its localization is incorrect because Place argument share similar attention. When one argument targets at too many instances, attention heatmaps tend to lose focus and cover the whole image, as shown in Figure 8.
5 Related Work
Text Event Extraction Text event extraction has been extensively studied for general news domain ji2008refining; liao2011acquiring; huang2012bootstrapped; li2013joint; chen2015event; nguyen2016joint; P18-1048; D18-1156; D18-1158; Zhang2019; liu2018jointly; wang2019open; yang2019exploring; wadden2019entity. Multimedia features has been proven to effectively improve text event extraction (zhang2017improving).
Visual Event Extraction “Events” in NLP usually refer to complex events that involve multiple entities in a large span of time (e.g. protest), while in CV chang2016bi; zhang2007semantic; ma2017joint events are less complex single-entity activities (e.g. washing dishes) or actions (e.g. jumping). Visual event ontologies focus on daily life domains, such as “dogshow” and “wedding ceremony” (perera2012trecvid). Moreover, most efforts ignore the structure of events including arguments. There are a few methods that aim to localize the agent gu2018ava; li2018recurrent; duarte2018videocapsulenet, or classify the recipient sigurdsson2016hollywood; kato2018compositional; wu2019long of events, but neither detects the complete set of arguments for an event. The most similar to our work is Situation Recognition (SR) (yatskar2016situation; mallya2017recurrent) which predicts an event and multiple arguments from an input image, but does not localize the arguments. We use SR as an auxiliary task for training our visual branch, but exploit object detection and attention to enable localization of arguments. silberer2018grounding redefine the problem of visual argument role labeling with event types and bounding boxes as input. Different from their work, we extend the problem scope to including event identification and coreference, and further advance argument localization by proposing an attention framework which does not require bounding boxes for training nor testing.
Multimedia Representation Multimedia common representation has attracted much attention recently toselli2007viterbi; weegar2015linking; hewitt2018learning; chen2019uniter; liu2019focus; Su_2019_ICCV; Sarafianos_2019_ICCV; sun2019videobert; tan2019lxmert; li2019unicoder; li2019visualbert; lu2019vilbert; sun2019contrastive; rahman2019m; su2019vl. However, previous methods focus on aligning images with their captions, or regions with words and entities, but ignore structure and semantic roles. UniVSE Wu2019UniVSERV incorporates entity attributes and relations into cross-media alignment, but does not capture graph-level structures of images or text.
6 Conclusions and Future Work
In this paper we propose a new task of multimedia event extraction and setup a new benchmark. We also develop a novel multimedia structured common space construction method to take advantage of the existing image-caption pairs and single-modal annotated data for weakly supervised training. Experiments demonstrate its effectiveness as a new step towards semantic understanding of events in multimedia data. In the future, we aim to extend our framework to extract events from videos, and make it scalable to new event types. We plan to expand our annotations by including event types from other text event ontologies, as well as new event types not in existing text ontologies. We will also apply our extraction results to downstream applications including cross-media event inference, timeline generation, etc.
This research is based upon work supported in part by U.S. DARPA AIDA Program No. FA8750-18-2-0014 and U.S. DARPA KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.