1 Introduction
Financial markets are sensitive to breaking news on economic events. Specifically for crude oil markets, it is observed in [brandt2019macro] that news about macroeconomic fundamentals and geopolitical events affect the price of the commodity. Apart from fundamental market factors, such as supply, demand, and inventory, oil price fluctuation is strongly influenced by economic development, conflicts, wars, and breaking news [wu2021effective]. Therefore, accurate and timely automatic identification of events in news items is crucial for making timely trading decisions. Commodity news typically contain these few key information: (i) analysis of recent commodity price movements (up, down or flat), (ii) a retrospective view of notable event(s) that led to such a movement, and (iii) forecast or forward-looking analysis of supply-demand situation as well as projected commodity price targets. Here is a snippet taken from a piece of crude oil news: U.S. crude stockpiles soared by 1.350 million barrels in December from a mere 200 million barrels to 438.9 million barrels, due to this oversupply crude oil prices plunged more than 50% as Tuesday.
There is a small number of corpora in the Finance and Economics domain such as SENTiVENT in [jacobs2021sentivent], but all are focused on company-specific events, and are used mainly for the purpose of stock price prediction. As acknowledged by [jacobs2021sentivent], due to the lack of annotated dataset in the Finance and Economics domain, only a handful of supervised approaches exist for Financial Information Extraction. To the best of our knowledge, there is no available annotated corpus for crude oil or any other commodities. We aim to contribute towards resource building and facilitate future research in this area by introducing CrudeOilNews corpus, an event extraction dataset focusing on Macro-economic, Geo-political, and crude oil supply and demand events. The annotation schema used are aligned to ACE (Automatic Content Extraction) and ERE (Entities, Relations, and Events) standards, so event extraction systems developed for ACE/ERE can be used readily on this corpus.
The contributions of this work are as follows:
-
Introduced CrudeOilNews corpus, the first annotated corpus for crude oil consisting of 425 crude oil news articles. It is a ACE/ERE-like corpus with the following annotated: (i) Entity mentions, (ii) Events (triggers and argument roles), and (iii) Event Properties (Polarity, Modality, and Intensity;
-
Introduced a new event property to capture a more complete representation of events. The new property -INTENSITY captures the state of an existing event whether it further intensifies or eased;
-
Addressed the obvious class-imbalance in event properties by over-sampling minority classes and adding them into corpus through data augmentation;
-
Used Human-in-the-Loop Active Learning to expand the corpus with model inference while optimizing human annotation effort to focus on just less confident (and likely less accurate) predictions.
2 Related Work
2.1 Annotation Methodologies
The annotation methodologies presented here are conceived based on the annotation standards of ACE, and ERE. An extensive comparison has been made in [aguilar2014comparison], where authors analyzed and provided a summary of the different annotation approaches. Subsequently there was a number of works that expanded earlier annotation standards, such as in [ogorman-etal-2016-richer], authors introduced the Richer Event Description (RED) corpus and methodologies that annotate entities, events, times, entities relations (co-reference and partial co-reference), and events relations (temporal, causal, and sub-events). We have strived to align to ACE/ERE programs as closely as possible, but have made minor adaptations to cater to special characteristics found in crude oil news. Tense and Genericity defined in ACE2005 are dropped from our annotation scope while the new property - Intensity is introduced.
2.2 Finance and Economic Domain
In the domain of Finance and Economics, majority of the available datasets are on company-related events and are used mainly for extracting company-related events for company stock price prediction. A variety of methods are used in economic event detection, such as hand-crafted rule-sets, ontology knowledge-bases, and using techniques like distant, weakly or semi-supervised training. For a more targeted discussion, we focus only on manually annotated datasets suitable for supervised training.
In [jacobs2018economic], [lefever2016classification], authors introduced a dataset focused on annotating continuous trigger spans of 10 types and 64 subtypes of company-economic events in a corpus of English and Dutch economic text, some examples of event types are Buy ratings, Debt, Dividend, Merger & acquisition, Profit, Quarterly results. As a continuation of the work, the authors introduced SENTiEVENT, a fine-grained ACE/ERE-like dataset in [jacobs2021sentivent]. Just like the earlier work, their focus is mainly on company-related and financial events. Among the list of defined event topology, the only category that has an overlap to our work is “Macroeconomics”, an event category that captures a broad range of events that is not company-specific such as economy-wide phenomena, and governmental policy in news. While they choose to remain at a broad level, our work compliments theirs by defining key events in the Macro-economic and Geo-political category at a detailed-level.
As part of the search for commodity news related resources, we came across RavenPack’s111 RavenPack is an analytics provider for financial services. Among their services are finance and economic news sentiment analysis. More information can be found on their page:
3 Dataset Collection
First, we crawled crude oil news articles from investing.com222https://www.investing.com/commodities/crude-oil-news, a financial platform and financial/business news aggregator and is considered one of the top three global financial websites in the world.
We crawled news articles dating from Dec 2015 to Jan 2020 (50 months). From the pool of crude oil news, we uniformly sampled 175 pieces of news articles throughout the 50-month period to ensure events are evenly represented and not skewed towards a certain topic of a particular time window. These 175 news articles were duly annotated by two annotators and they form the gold-standard annotation. For the purposes of assessing the inter-annotator agreement and evaluating the annotation guidelines, 25 news were selected out of the gold-standard dataset as the adjudicated set (ADJ).

4 Annotation Setup
The dataset is annotated using Brat rapid annotation tool [stenetorp2012brat], a web-based tool for text annotation. An example of a sentence annotated using Brat is shown in Figure 1.
The annotation process is designed to have high inter-annotator agreement (IAA). One of the criteria is that the annotators should possess domain knowledge in business, finance, and economics. It is imperative for the annotators to understand financial and macro-economic terms and concepts to interpret the text accurately and annotate events accordingly. For instance, sentences containing macro-economic terms such as contango, quantitative easing, and backwardation will require annotators to have finance and economics domain knowledge. To meet this criteria, we recruited two annotators from a pool of undergraduate students from the School of Business of a local university. Annotators were then given annotation training and provided with clear annotation schemas and examples. Every piece of text was duly annotated by two annotators independently.
The annotation was done based on the following layers sentence by sentence:
-
Layer 1: Identify and annotate entity mentions.
-
Layer 2: Annotate events by identifying event triggers.
-
Layer 3: Using event triggers as anchors, identify and link surrounding entity mentions to their respective events. Annotate the argument roles each entity mention plays with respect to the events identified.
-
Layer 4: Annotate event properties: modality, polarity and intensity.
After each layer, an adjudicator assessed the annotation and evaluated inter-annotator agreement before finalizing the annotation. For cases where there are annotation discrepancies, the adjudicator will act as the tie-breaker to decide on the final annotation. Once finalized, annotators then proceed with the next layer. This is done to ensure no accumulation of the previous layer’s errors in the subsequent layers of annotation.
4.1 Annotation Guidelines
This section describes our definition of events and principles for annotation of entity mentions and events. Annotation of events is further divided into (i) annotating entity mentions, (ii) annotating event triggers, (iii) linking entity mentions to their respective events and identifying the argument roles each entity plays, (iv) assigning the right property labels to Polarity, Modality and Intensity.
4.2 Entity Mention
An entity mention is a reference to an object or a set of objects in the world, including named entities, nominal entities, and pronouns. For simplicity and convenience, values and temporal expressions are also considered as entity mentions in this work. There are 21 entity types identified and annotated in the dataset, see Appendix C for the full list. Nominal entities relating to Finance and Economics are annotated. Apart from crude oil-related terms, below here are some examples of nominal entities found in the corpus and was duly annotated:
-
attributes : price, futures, contract, imports, exports, consumption, inventory, supply, production
-
economic entity : economic growth, economy, market(s), economic outlook, growth, dollar
4.3 Events
Events are defined as ‘specific occurrences’, involving ‘specific participants’. The occurrence of an event is marked by the presence of an event trigger. In addition to identifying triggers, all of the participants of each event are also identified. An event’s participants are entities that play a role in that Event. Details and rules for identifying event triggers and event Arguments are covered below:
Event Triggers
Our annotation of event trigger is aligned to ERE where an event trigger (known as event nugget in the shared task in [mitamura2015event]) can be either a single word (main verb, noun, adjective, adverb) or a continuous multi-word phrase. Here are some examples found in the dataset:
-
Verb: Houti rebels attacked Saudi Arabia.
-
Noun: The government slapped sanctions against its petroleum….
-
Adjective: A fast growing economy has…
-
Multi-verb: The market bounced back….
Event trigger is the minimal span of text that most succinctly expresses the occurrence of an event. Annotators are instructed to keep the trigger as small as possible while maintaining the core lexical semantics of the event. For example, for phrase “Oil price edged lower”, only the trigger word “lower” is annotated.
Event Arguments
After event triggers and entity mentions are annotated, entities need to be linked up to form events. An event contains an event trigger and a set of event arguments. Referring to Figure 1, the event trigger soared is linked to seven entity mentions via arches. The argument role of each entity mention is labeled on each arch respectively, while entity types are labeled in various colours on top of each entity span. This information is also summarized in tabular format in Table 1.
. Entity Argument Role U.S. SUPPLIER crude ITEM stockpiles ATTRIBUTE 1,350 million barrels DIFFERENCE December REFERENCE_POINT_TIME 200 million barrels INITIAL_VALUE 438.9 million barrels FINAL_VALUE
4.3.1 Event Typology
According to [brandt2019macro] who analyzed Ravenpack’s sentiment score of each event type and oil price, events that move commodity prices are geo-political, macro-economic and commodity supply and demand in nature. Based on Ravenpack’s event taxonomy, we have defined a set of 18 oil-related event types as our annotation scope. Event types and the corresponding list of example trigger words and example key arguments are listed out in Table 2. See Appendix D for event schema for all 18 event types.
Event Type | Example Trigger Word(s) | Example key arguments |
---|---|---|
1. CAUSED-MOVEMENT-DOWN-LOSS | cut, trim, reduce, disrupt, curb, squeeze, choked off | oil production, oil supplies, interest rate, growth forecast |
2. CAUSED-MOVEMENT-UP-GAIN | boost, revive, ramp up, prop up, raise | oil production, oil supplies, growth forecast |
3. CIVIL-UNREST | violence, turmoil, fighting, civil war, conflicts | Libya, Iraq |
4. CRISIS | crisis, crises | debt, financial |
5. EMBARGO | embargo, sanction | Iraq, Russia |
6. GEOPOLITICAL-TENSION | war, tensions, deteriorating relationship | Iraq-Iran |
7. GROW-STRONG | grow, picking up, boom, recover, expand, strong, rosy, improve, solid | oil production, economic growth, U.S. dollar, crude oil demand |
8. MOVEMENT-DOWN-LOSS | fell, down, less, drop, tumble, collapse, plunge, downturn, slump, slide, decline | crude oil price, U.S. dollar, gross domestic product (GDP) growth |
9. MOVEMENT-FLAT | unchanged, flat, hold, maintained | oil price |
10. MOVEMENT-UP-GAIN | up, gain, rise, surge, soar, swell, increase, rebound | oil price, U.S. employment data, gross domestic product (GDP) growth |
11. NEGATIVE-SENTIMENT | worries, concern, fears | |
12. OVERSUPPLY | glut, bulging stock level, excess supplies | |
13. POSITION-HIGH | high, highest, peak, highs | |
14. POSITION-LOW | low, lowest, lows, trough | |
15. PROHIBITION | ban, bar, prohibit | exports, imports |
16. SHORTAGE | shortfall, shortage, under-supplied | oil supply |
17. SLOW-WEAK | slow, weak, tight, lackluster, falter, weaken, bearish, slowdown, crumbles | global economy, regional economy, economic outlook, crude oil demand |
18. TRADE-TENSIONS | price war, trade war, trade dispute | U.S.-China |
4.3.2 Event Property: Event Polarity, Modality and Intensity
After events are identified, they are also assigned a label each for the properties respectively.
Polarity
(POSITIVE and NEGATIVE)
An event has the value POSITIVE unless there is an explicit indication that the event did not take place, in which case NEGATIVE is assigned.
Modality
(ASSERTED and OTHER)
Event modality determines whether the event represents a “real” occurrence. ASSERTED is assigned if the author or speaker refers to it as though it were a real occurrence, and OTHER otherwise. OTHER covers believed events, hypothetical events, commanded and requested event, threats, proposed events, discussed events, desired events, promised events, and other unclear construct.
Intensity
(NEUTRAL, INTENSIFIED, and EASED)
Event intensity is a new event property, specifically created for this work to better represent events found in this corpus. Oftentimes, events reported in Crude Oil News are about the intensity of an existing event, whether the event is further intensified or eased.
Examples of events where one is INTENSIFIED and the other one EASED:
…could hit Iraq ’s output and deepen a supply shortfall. [INTENSIFIED]
Libya ’s civil strife has been eased by potential peace talks. [EASED] The event strife (civil unrest) in sentence (2) is not an event with negative polarity because the event has actually taken place but with reduced intensity. INTENSITY label is used to capture the interpretation accurately, showing that the civil unrest event has indeed taken place but now with updated ‘intensity’.
With these three event properties, we can annotate and capture all essential information about an event. To further illustrate this point, consider the list of examples of complex events below:
OPEC cancelled a planned easing of output cuts. [NEGATIVE, OTHER, EASED]
In order to end the global crisis, OPEC may hesitate to implement a planned loosening of output curbs. [NEGATIVE, OTHER, EASED]
Oil prices rose to $110 a barrel on rumours of a renewed strife. [POSITIVE, OTHER, INTENSIFIED]
4.4 Inter-Annotator Agreement
Inter-annotator agreement (IAA) is a good indicator of how clear our annotation guidelines are, how uniformly annotators understand it, how robust are the event typology and overall how feasible the annotation task is. We evaluate IAA on each annotated category separately (see Table 3 for the list) using the most commonly measurement: Cohen’s Kappa, with the exception of entity spans and trigger spans. These two annotations are made at token level, forming spans of a single token or multiple continuous tokens. For the sub-tasks of entity mention detection and trigger detection
, the token-level span annotation were unitized to compute IAA, this approach is similar to unitizing and measuring agreement in Named Entity Recognition
[mathet2015unified]. According to [hripcsak2005agreement], Cohen’s kappa is not the most appropriate measurement for IAA in Named Entity Recognition. In [deleger2012building], authors provided an in-depth analysis of why is the case and proposed the use of pairwise F1 score as the measurement. Hence for the evaluation of entity spans and trigger spans, we report on both F1 as well as “token-level” kappa. Both score were measured without taking into account the un-annotated tokens - labelled ”O”.As for the rest of the annotation category, we report only on Cohen’s Kappa as this is the standard measure of IAA for classification task. We calculate the agreement by comparing annotation outcomes of the two annotators with each other, arbitrarily treating one as the ‘gold’ reference. We also scored each annotator separately on the adjudicated (ADJ) set. The ADJ set consists of 25 documents collected through correcting and combining the manual annotations of these documents by the adjudicator. The final scores are calculated by averaging the results across all comparisons. Table 3 shows the average agreement scores for all annotation categories.
Event nugget scoring method introduced in [liu2015evaluation] was not used here because their assessment is rolled up into “Span”, “Type”, and “Realis”, too coarse to show IAA on each annotation category.
Task | Cohen’s Kappa | F1 Score |
---|---|---|
Entity spans | 0.82 | 0.91 |
Trigger spans | 0.68 | 0.75 |
Entity Type | 0.89 | - |
Event Type | 0.79 | - |
Argument Role | 0.78 | - |
Event Polarity | 0.70 | - |
Event Modality | 0.63 | - |
Event Intensity | 0.59 | - |
Analysis
We benchmark these IAA scores with the ‘strength of agreement’ of each Kappa ranges as set out by [landis1977measurement]. Most annotation categories achieved substantial agreement with the exception of Intensity
classification. This is because classifying
Intensity is more challenging where some of the cue words for determining the event intensity are themselves trigger words. For example:Oversupply could rise next year when Iraq starts to export more oil.
The word rise here is a cue word to indicate that oversupply might be further INTENSIFIED but it also could be misinterpreted as another separate event. On the other hand, we achieve very high agreement on identifying entity spans. This is because entities in the news articles are majority Named Entities with very clear span boundaries, and classifying the entities to the correct entity type is also rather straight forward. Even for nominal entities such as crude oil, oil markets, and etc, their span boundaries are clear.
The common mistake in trigger span detection and classification is the different interpretation of the minimum span of an event trigger. Examples of common annotation errors are: (i) the trigger word for “crude oil inched higher” should be just “higher”, and (ii) ”Oil pursued an upward trend” should be just “upward trend”.
From the cases where annotators disagree, we analyze and found that most of them stem from differences in interpreting special concepts for example:
-
The word outlook, should it be interpreted as forecast? Or, should it be considered as a cue word for event modality?
-
If events surrounding US employment data are annotated, then what about unemployment? Should this be treated as employment data but negated using negative polarity?
-
How should double negation be treated? For example, ‘failed attempt to prevent a steep drop in oil prices’, both failed and prevent are considered negative polarity cue words, creating a double negation situation.
For these non-straight forward cases, each one was handled on a case-by-case basis where the adjudicator discussed each situation with the annotators to seek consensus before finalizing an agreed annotation.
5 Expanding the Dataset
Manual annotations are labour-intensive and time-consuming, as this is seen in our gold-standard manual annotation where it consists only 175 documents or news articles. In order to produce a sufficiently large dataset useful for supervised event extraction, we utilize (1) Data Augmentation and (2) Human-in-the-Loop Active Learning.
Gold Dev Set | Before | Augmentation | Updated Count | After | |||
---|---|---|---|---|---|---|---|
Event Properties | Ratio | # Events | F1 | # Events | Ratio | # Events | F1 |
Polarity: POSITIVE | 97.01% | 2,855 | 0.76 | 965 | 95.40% | 3,820 | 0.76 |
Polarity: NEGATIVE | 2.99% | 88 | 0.24 | 96 | 4.60% | 184 | 0.39 |
Modality: ASSERTED | 82.94% | 2,441 | 0.71 | 771 | 80.22% | 3,212 | 0.74 |
Modality: OTHER | 17.06% | 502 | 0.35 | 290 | 19.78% | 792 | 0.42 |
Intensity: NEUTRAL | 93.78% | 2,760 | 0.76 | 745 | 87.54% | 3,505 | 0.85 |
Intensity: EASED | 3.64% | 107 | 0.36 | 196 | 7.57% | 303 | 0.49 |
Intensity: INTENSIFIED | 2.58% | 77 | 0.25 | 120 | 4.90% | 196 | 0.37 |
5.1 Data Augmentation
The main purpose of introducing augmented data is to address the issue of serious class imbalance in Event Properties in the dataset. Table 4 shows event properties classification results. The pink-coloured cells show the model’s F1-score when trained on gold-standard dev dataset, F1-scores for minority classes are rather low. As a strategy to overcome class imbalance, we manually over-sample the minority classes for data augmentation and introduce them into the dataset. To this end, we carried out data augmentation through (i) trigger word replacement and (ii) event argument replacement).
Trigger word replacement
: FrameNet333https://framenet.icsi.berkeley.edu was utilized to augment available data and to generate both diverse and valid examples. Authors in [aguilar2014comparison] pointed out that all events, relations, and attributes that were represented by ACE/ERE can be mapped to FrameNet representations through some adjustments. In the selected sentences, we replaced the event trigger words with words (known as lexical units
in FrameNet) of the same frame in FrameNet. The idea is to replace the existing trigger word with another valid trigger word while maintaining the same semantic meaning (in FrameNet’s term - maintaining the same frame). Through this exercise, we also introduced richer lexical variance in the dataset.:
The benchmark for oil prices advanced 1.29% to $74.71.
Candidates: [surged, rose, appreciated, climbed]
Event argument replacement
Event argument replacement candidates were chosen from a pool of candidates of the same entity type and the same argument role within the pool of existing annotations, as illustrated below:
…..after civil-unrest in Libya…
Candidates: [Iraq, Nigeria, Ukraine]
After adding augmented data into the training, the green-coloured cells in Table 4 show improved F1 scores for minority classes across all three event properties. We add augmented data into the gold-standard dev set to form the new development set and use it to train baseline models for human-in-the-loop active learning.
5.2 Human-in-the-loop Active Learning

Active learning is well-motivated in many modern machine learning problems where data may be abundant but labels are scarce or expensive to acquire [settles2009active]. Human-in-the-loop Active Learning is a strategy of utilizing human expertise in data annotation in a more efficient manner. It is a process of training a model with available labeled data and then uses the model to predict on unlabeled data. Predictions that are ‘uncertain’ (or of low confidence) is then given to human experts for verification. Verified labels are then added into the pool of labeled dataset for training. These predictions are chosen based on uncertainty sampling, a sampling strategy to filter out predictions that the model is least confident in. This way we narrow down the scope and have human experts work specifically on these instances. Rather than blindly adding more training data incurring more cost and time, here we target instances that are near the model’s decision boundary, they are valuable when labeled correctly and added into the training data to improve model performance. The whole active learning process is shown in Figure 2.
Least Confidence score
: Least confidence score,
captures how un-confident (or uncertain) a model prediction is. For a probability distribution over a set of labels
for the input , the least confidence score is given by the following equation, where is the highest confidence softmax score:(1) |
The equation produces a Least Confidence (LC) scores to a 0-1 range, where 1 is the most uncertain score while 0 is the most confidence score, is the number of classes for . The score is normalized for number of classes by multiplying the result by number of classes, and divided by . Hence it can be used in binary classification as well as multi-class classification . Any model predictions with score above the threshold is sampled as they are most likely to be classified wrongly and need to be relabeled by a human annotator.
5.2.1 Baseline models
As baseline for the first round of Active Learning, we trained a number of basic or ‘vanilla’ machine learning models, one for each sub-tasks using the new development set (described in Section 5.1) as training data and ADJ set as test data (See Table 5 for key statistics). These “vanilla” models also act as the pilot study demonstrating the use of this dataset in event extraction. The following section describes how these models are trained.
Entity Mention Detection Model
: We formalize Entity Mention Detection task as a multi-class token classification. Similar to the approach used in [nguyen2016joint], we employ BIO annotation schema to assign entity type labels to each token in the sentences. For the model architecture, we use Huggingface’s BERTForTokenClassification to fine-tune on this task.
Event Extraction Model
: We jointly train Event Detection together with Argument Role Prediction using JMEE (Joint Multiple Event Extraction), an event extraction solution proposed by [liu-etal-2018-jointly]. The original version of JMEE uses GloVe word embedding, for this work we used a modified version of JMEE that replaces GloVe with BERT [devlin-etal-2019-bert] contextualized word embeddings, codes are available here.
Event Properties Classification
: We use BERTForSequenceClassification model to fine-tune on this task. For every event identified in earlier model, we extract the event ‘scope’ as input for the training. This ‘scope’ is made up of the trigger word(s) being the anchor plus tokens surrounding it. For the training, we use = 8. Using the example sentence presented in Figure 1, the ‘scope’ for the second event is “oversupply crude oil prices plunged more than 50% on”. This sequence of text is fed into the model for event property classification.
5.2.2 Experiments & Analysis
Least Confidence (LC) threshold
: In order to find the optimum sample size for human relabeling, we need to determine the suitable LC threshold. We design the uncertainty sampling exercise as a Binary Classification task with two outcomes: sampled and not-sampled. We experimented with different threshold values to find the optimum sample size for human validation. Apart from being used in the IAA study, the adjudicated (ADJ) set is also used here as the hold-out set to determine the best LC threshold. We checked the sampled and not-sampled
instances against the ground-truth in ADJ, and were able to construct the confusion matrix and obtain
Precision, Recall and F1 scores. Ideally, we want a high Recall score (sample as many erroneous cases as possible for human relabeling) and a high Precision score as well (identify only relevant instances for correction by keeping correct ones away from being sampled). We experimented with different LC threshold value ranging from 0 to 1 in order to find the best threshold that produces sampled and not-sampled split with the best F1 score (the highest precision-recall pair). We carry out all iterations of active learning (described next) using the following LC thresholds: Entity Mention Detection - 0.60, Trigger Detection - 0.55, Argument Roles Prediction - 0.50, Event Polarity - 0.40, Modality - 0.30, and Intensity - 0.45.Experiments
:

We carried out 5 iterations of active learning, each iteration involves 50 unlabeled crude oil news being labeled through model prediction. Then we ran uncertainty sampling and arranged for two annotators to validate the samples and relabel them if needed. For sentence not sampled, they are deemed ’confident’ and therefore being validated/checked by just a single annotator.
Analysis
: Overall we see improvements in model performance across all sub-tasks. As shown in Figure 3, models performance progressively improved after each iteration. This is because as more annotated training data are added to the training, the more “confident” the model gets the fewer instances are sampled under uncertainty sampling in each iterations. This inverse relationship is shown in Figure 3. It is clear that as model performance (Micro F1 measure) improves, the percentage of sampled data decreases.
The least confidence sampling approach is very effective in identifying data points that are near the model’s decision boundary. In the case of event type, typically these are events types that can easily confused with other types. For example, the model erroneously classify trade tension as Geopolitical-Tension when the right class should be Trade-Tensions. As the word ‘tension’ exist in both event types, it is understandable why the model makes such a mistake. Least confidence sampling is also able to pick up instances of minority classes. Due to the fact that for minority classes, the model has significantly fewer data to learn from, leading the model to generate predictions that are less ‘confident’.
6 Corpus Statistics and Analysis
In total, we managed to produce a final dataset consisting of 425 documents, which consist of 7,059 sentences, 10,578 events, 22,267 arguments. The breakdown is shown in Table 5.
Gold-standard | Aug | 5-Iter | ||
Dev | Test/ADJ | Active L. | ||
# documents | 150 | 25 | - | 250 |
# sentences | 2,557 | 377 | 372 | 3,753 |
# tokens | 68,219 | 9,754 | 12,695 | 99,884 |
# Entities | 7,120 | 1,970 | 1,838 | 19,417 |
# Events | 2,943 | 577 | 1,061 | 5,997 |
# Arguments | 5,716 | 1,276 | 1,693 | 13,582 |
6.1 Key Characteristics
We observe a few key characteristics of this corpus that are distinct from ACE2005 and ERE datasets. These need to be taken into consideration when adapting existing event extraction systems or building a new one for this corpus:
-
Obvious class imbalance in event properties distribution where the majority class outnumbers the minority classes by a large margin (see Table 4). We have attempted to minimize this margin by oversampling minority classes through data augmentation but the margin is still quite substantial;
-
Homogenous entity types but play different argument roles (e.g., price - non-distinguishable from entity type MONEY or UNIT-PRICE, play different role such as opening price, closing price, and price difference).
-
Number intensity: Numbers (e.g., price, difference, percentage of change) and dates (including date of the opening price, dates of closing price) are abundant.
7 Conclusion and Future Work
Event extraction in the domain of finance and economics at the moment are limited to only company-related events. To contribute to the building of resources in this domain, we have presented CrudeOilNews corpus, a ACE/ERE-like corpus. This corpus contains 425 documents, with around 11,000 events annotated. We have also shared methodologies of how these information were annotated. Inter-annotator agreement is generally substantial and annotator performance is adequate, indicating that the annotation scheme produces consistent event annotations of high quality.
There are a number of avenues for future work. The main area that can be further explored is to expand the annotation scope to cover more event types. Next, this work can also be expanded to cover event co-reference and event-event relations such as causal-relation, main-sub-event, event-sequence, and contradictory event relation. Besides that, the current sentence-level annotation can be extended to cater for event relations spanning multiple sentences, so that event extraction and relation extract can be done at the document level.
8 Bibliographical References
References
Appendix A Detailed Corpus Statistics
Gold Annotation | Augmented | 5-Iter AL | Final Count | |||
Event type | Dev | Test/ADJ | # Instance | Ratio | ||
1. Cause-movement-down-loss | 359 | 37 | 179 | 426 | 1,001 | 9.46% |
2. Cause-movement-up-gain | 72 | 7 | 16 | 83 | 178 | 1.68% |
3. Civil-unrest | 57 | 3 | 43 | 47 | 150 | 1.42% |
4. Crisis | 19 | 4 | 11 | 34 | 68 | 0.64% |
5. Embargo | 115 | 7 | 44 | 44 | 210 | 1.99% |
6. Geopolitical-tension | 42 | 10 | 25 | 125 | 202 | 1.91% |
7. Grow-strong | 167 | 16 | 71 | 280 | 534 | 5.05% |
8. Movement-down-loss | 697 | 149 | 213 | 1,881 | 2,940 | 27.70% |
9. Movement-flat | 47 | 2 | 14 | 42 | 105 | 0.99% |
10. Movement-up-gain | 683 | 178 | 203 | 1,637 | 2,701 | 25.53% |
11. Negative-sentiment | 116 | 43 | 73 | 307 | 539 | 0.99% |
12. Oversupply | 65 | 9 | 45 | 112 | 231 | 2.18% |
13. Position-high | 132 | 33 | 19 | 377 | 561 | 5.03% |
14. Position-low | 99 | 47 | 24 | 323 | 493 | 4.66% |
15. Prohibition | 39 | 1 | 3 | 6 | 49 | 0.46% |
16. Shortage | 31 | 1 | 10 | 5 | 47 | 0.44% |
17. Slow-weak | 164 | 27 | 52 | 262 | 505 | 4.77% |
18. Trade-tensions | 39 | 3 | 16 | 6 | 64 | 0.61% |
Total | 2,943 | 577 | 1,061 | 5,997 | 10,578 |
Appendix B Active Learning details
Entity | Trigger | Arguments | Polarity | Modality | Intensity | |
Threshold | 0.6 | 0.55 | 0.50 | 0.40 | 0.30 | 0.45 |
Iter. | % of # tokens | % of # tokens | % of Trigger-Entity Pair | % of events | % of events | % of events |
1 | 72 | 68 | 75 | 73 | 69 | 79 |
2 | 65 | 63 | 71 | 69 | 53 | 65 |
3 | 61 | 61 | 65 | 63 | 49 | 61 |
4 | 53 | 59 | 62 | 51 | 41 | 58 |
5 | 42 | 49 | 51 | 49 | 39 | 49 |
Iter. | Training Set | Entity | Trigger | Argument | Polarity | Modality | Intensity |
- | Gold Dev | 0.71 | 0.74 | 0.56 | 0.74 | 0.71 | 0.75 |
- | Gold Dev + Augmented (New Dev) | 0.72 | 0.75 | 0.57 | 0.75 | 0.73 | 0.75 |
1 | New Dev + 50 docs | 0.72 | 0.75 | 0.59 | 0.75 | 0.76 | 0.73 |
2 | New Dev + 100 docs | 0.78 | 0.79 | 0.62 | 0.79 | 0.81 | 0.77 |
3 | New Dev + 150 docs | 0.83 | 0.81 | 0.64 | 0.81 | 0.83 | 0.81 |
4 | New Dev + 200 docs | 0.85 | 0.83 | 0.65 | 0.83 | 0.85 | 0.82 |
5 | New Dev + 250 docs | 0.86 | 0.85 | 0.69 | 0.84 | 0.89 | 0.83 |
Note: New development set the baseline model described in Section 5.2.1.
Appendix C Entity Mention Types
Entity Type | Examples |
---|---|
1. Commodity | oil, crude oil, Brent, West Texas Intermediate (WTI), fuel, U.S Shale, light sweet crude, natural gas |
2. Country** | Libya, China, U.S, Venezuela, Greece |
3. Date** | 1998, Wednesday, Jan. 30, the final quarter of 1991, the end of this year |
4. Duration** | two years, three-week, 5-1/2-year, multiyear, another six months |
5. Economic Item | economy, economic growth, market, economic outlook, employment data, currency, commodity-oil |
6. Financial attribute | supply, demand, output, production, price, import, export |
7. Forecast target | forecast, target, estimate, projection, bets |
8. Group | global producers, oil producers, hedge funds, non-OECD, Gulf oil producers |
9. Location** | global, world, domestic, Middle East, Europe |
10. Money** | $60, USD 50 |
11. Nationality** | Chinese, Russian, European, African |
12. Number** | (any numerical value that does not have a currency sign) |
13. Organization** | OPEC, Organization of Petroleum Exporting Countries, European Union, U.S. Energy Information Administration, EIA |
14. Other activities | (free text) |
15. Percent** | 25%, 1.4 percent |
16. Person** | Trump, Putin (and other political figures) |
17. Phenomenon | (free text) |
18. Price unit | $100-a-barrel, $40 per barrel, USD58 per barrel |
19. Production Unit | 170,000 bpd, 400,000 barrels per day, 29 million barrels per day |
20. Quantity | 1.3500 million barrels, 1.8 million gallons, 18 million tonnes |
21. State or province** | Washington, Moscow, Cushing, North America |
Appendix D Event Schema
d.1 Movement-down-loss, Movement-up-gain, Movement-flat
Example sentence: [Globally] [crude oil] [futures] surged [$2.50] to [$59 per barrel] on [Tuesday].
Role | Entity Type | Argument Text |
---|---|---|
Type | Nationality, Location | globally |
Place | Country, Group, Organization, Location, State or province, Nationality | |
Supplier_consumer | Organization, Country, State_or_province, Group, Location | |
Reference_point_time | Date | Tuesday |
Initial_reference_point | Date | |
Final_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | $59 per barrel |
Initial_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | |
Item | Commodity, Economic_item | crude oil |
Attribute | Financial_attribute | futures |
Difference | Percentage, Number, Money, Production_unit, Quantity | $2.50 |
Forecast | Forecast_target | |
Duration | Duration | |
Forecaster | Organization |
d.2 Caused-movement-down-loss, Caused-movement-up-gain
Example sentence: The [IMF] earlier said it reduced its [2018] [global] [economic growth] [forecast] to [3.30%] from a [July] forecast of [4.10%].
Role | Entity Type | Argument Text |
---|---|---|
Type | Nationality, Location | global |
Place | Country, Group, Organization, Location, State or province, Nationality | West African, European |
Supplier_consumer | Organization, Country, State_or_province, Group, Location | |
Reference_point_time | Date | 2018 |
Initial_reference_point | Date | July |
Final_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | 3.30% |
Initial_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | 4.10% |
Item | Commodity, Economic_item | economic growth |
Attribute | Financial_attribute | |
Difference | Percentage, Number, Money, Production_unit, Quantity | |
Forecast | Forecast_target | forecast |
Duration | Duration | |
Forecaster | Organization | IMF |
d.3 Position-high, Position-low
Example sentence: The IEA estimates that U.S. crude oil is expected to seek higher ground until reaching a [5-year] peak in [late April] of about [17 million bpd].
Role | Entity Type | Argument Text |
---|---|---|
Reference_point_time | Date | late April |
Initial_reference_point | Date | |
Final_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | 17 million bpd |
Initial_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | |
Item | Commodity, Economic_item | |
Attribute | Financial_attribute | |
Difference | Percentage, Number, Money, Production_unit, Quantity | |
Duration | Duration | 5-year |
d.4 Slow-weak, Grow-strong
Example sentence: [U.S.] [employment data] strengthens with the euro zone.
Role | Entity Type | Argument Text |
---|---|---|
Type | Nationality, Location | |
Place | Country, Group, Organization, Location, State or province, Nationality | U.S. |
Supplier_consumer | Organization, Country, State_or_province, Group, Location | |
Reference_point_time | Date | |
Initial_reference_point | Date | |
Final_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | |
Initial_value | Percentage, Number, Money, Price_unit, Production_unit, Quantity | |
Item | Commodity, Economic_item | employment data |
Attribute | Financial_attribute | |
Difference | Percentage, Number, Money, Production_unit, Quantity | |
Forecast | Forecast_target | |
Duration | Duration | |
Forecaster | Organization |
d.5 Prohibiting
Example sentence: [Congress] banned most [U.S.] [crude oil] [exports] on [Friday] after price shocks from the 1973 Arab oil embargo.
Role | Entity Type | Argument Text |
---|---|---|
Imposer | Organization, Country, Nationality, State or province, Person, Group, Location | Congress |
Imposee | Organization, Country, Nationality, State or province, Group | U.S. |
Item | Commodity, Economic_item | crude oil |
Attribute | Financial_attribute | exports |
Reference_point_time | Date | Friday |
Activity | Other_activities |
d.6 Oversupply
Example sentence: [Forecasts] for an [crude] oversupply in [West African] and [European] [markets] [early June] help to push the Brent benchmark down more than 20% January.
Role | Entity Type | Argument Text |
---|---|---|
Place | Country, Group, Organization, Location, State or province, Nationality | West African, European |
Reference_point_time | Date | this year |
Item | Commodity | crude |
Attribute | Financial_attribute | markets |
Difference | Production_unit | |
Forecast | Forecast_target | forecasts |
d.7 Shortage
Example Sentence: Oil reserves are within “acceptable” range in most oil consuming countries and there is no shortage in [oil] [supply] [globally], the minister added.
Role | Entity Type | Argument Text |
---|---|---|
Place | Country, State or province, Location, Nationality | Congress |
Item | Commodity | crude oil |
Attribute | Financial_attribute | exports |
Type | Location | globally |
Reference_point_time | Date |
d.8 Civil Unrest
Example sentence: The drop in oil prices to their lowest in two years has caught many observers off guard, coming against a backdrop of the worst violence in [Iraq] [this decade].
Role | Entity Type | Argument Text |
---|---|---|
Place | Country, State or province, Location, Nationality | Iraq |
Reference_point_time | Date | this decade |
d.9 Embargo
Example sentence: The [Trump administration] imposed a “strong and swift” economic sanctions on [Venezuela] on [Thursday].
Role | Entity Type | Argument Text |
---|---|---|
Imposer | Organization, Country, Nationality, State or province, Person, Group, Location | Trump administration |
Imposee | Organization, Country, Nationality, State or province, Group | Venezuela |
Reference_point_time | Date | Thursday |
Note: ‘Imposee’ is not formally a word, but used here as a shorter version of “Party whom the action was imposed on.
d.10 Geo-political Tension
Example sentence: Deteriorating relations between [Iraq] and [Russia] [first half of 2016] ignited new fears of supply restrictions in the market.
Role | Entity Type | Argument Text |
---|---|---|
Participating_countries | Country, Group, Organization, Location, State or province, Nationality | U.S., China |
Reference_point_time | Date | early June |
d.11 Crisis
Example Sentence: Asia ’s diesel consumption is expected to recover this year at the second weakest level rate since the [2014] [Asian] [financial] crisis.
Role | Entity Type | Argument Text |
---|---|---|
Place | Country, State or province, Location, Nationality | Asian |
Reference_point_time | Date | this year |
Item | Commodity, Economic_item | financial |
d.12 Negative Sentiment
Example sentence: Oil futures have dropped due to concern about softening demand growth and awash in crude.
Note: Negative Sentiment is a special type of event, where majority of the time it contains just the trigger words such as concerns, worries, fears and 0 event arguments.