Paraphrasing vs Coreferring: Two Sides of the Same Coin

04/30/2020 ∙ by Yehudit Meged, et al. ∙ Bar-Ilan University Allen Institute for Artificial Intelligence 0

We study the potential synergy between two different NLP tasks, both confronting lexical variability: identifying predicate paraphrases and event coreference resolution. First, we used annotations from an event coreference dataset as distant supervision to re-score heuristically-extracted predicate paraphrases. The new scoring gained more than 18 points in average precision upon their ranking by the original scoring method. Then, we used the same re-ranking features as additional inputs to a state-of-the-art event coreference resolution model, which yielded modest but consistent improvements to the model's performance. The results suggest a promising direction to leverage data and models for each of the tasks to the benefit of the other.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognizing that lexically-divergent predicates discuss the same event is a challenging task in NLP Barhom et al. (2019). Lexical resources such as WordNet Miller (1995) capture synonyms (say, tell), hypernyms (whisper, talk), and antonyms, which can be used with caution when the arguments are reversed ([a] beat [a], [a] lose to [a]). But WordNet’s coverage is insufficient, in particular missing context-specific paraphrases (hide, launder in the context of money). Distributional methods, on the other hand, enjoy a broader coverage, but their precision is limited because similar terms are as often as not mutually-exclusive (born, die) or are temporally or causally related (sentenced, convicted).

Two prominent lines of work pertaining to identifying predicates whose meaning or referents can be matched are cross-document (CD) event coreference resolution and recognizing predicate paraphrases. The former identifies and clusters event mentions across multiple documents, that refer to the same event within their respective contexts, while the latter collects pairs of event expressions that may refer to the same events in certain contexts. Table 1 illustrates this difference with examples of predicate paraphrases that are not always co-referring.

Event coreference resolution systems are typically supervised, using the ECB+ dataset Cybulska and Vossen (2014) which contains collections of news articles (documents) on different topics. Approaches for extracting predicate paraphrases are typically unsupervised, based on the similarity between the distribution of arguments Lin and Pantel (2001); Berant (2012), general paraphrasing approaches such as backtranslation Barzilay and McKeown (2001); Ganitkevitch et al. (2013); Mallinson et al. (2017), or leveraging redundant news reports of the same event Shinyama et al. (2002); Shinyama and Sekine (2006); Barzilay and Lee (2003); Zhang and Weld (2013); Xu et al. (2014); Shwartz et al. (2017).

Tara Reid has checked into Promises Treatment Center.
Actress Tara Reid entered well-known Malibu rehab center.
Lindsay Lohan checked into rehab in Malibu, California.
Director Chris Weitz is expected to direct New Moon.
Chris Weitz will take on the sequel to “Twilight”.
Gary Ross is still in negotiations to direct the sequel.
Table 1: Examples from ECB+ that illustrate the context-sensitive nature of event coreference. The event mentions are co-referable in certain contexts but are not always co-referring in practice.

In this paper, we study the potential synergy between predicate paraphrases and event coreference resolution. We show that the data and models for one task can benefit the other. In one direction, we use event coreference annotations from the ECB+ dataset as distant supervision to learn an improved scoring of predicate paraphrases in the unsupervised Chirps resource Shwartz et al. (2017). The distantly supervised scorer significantly improves upon ranking by the original Chirps scores, adding 18 points to average precision over a test sample.

In the other direction, we incorporate unlabeled data and features used for the re-scored Chirps method into a state-of-the-art (SOTA) event coreference system Barhom et al. (2019). We first assess that Chirps has a substantial coverage over ECB+ corefering mention pairs. Consequently, incorporating the Chirps source-data and features reduces 15% of the corference merging errors in ECB+ and yields a modest but consistent improvement across the various coreference metrics.

2 Background

2.1 Event Coreference Resolution

Event coreference resolution aims to identify and cluster event mentions, that, within their respective contexts, refer to the same event. The task has two variants, one in which corefering mentions are within the same document (within document) and another in which corefering mentions can be in different documents (cross-document, CD), on which we focused in this paper.

The standard dataset for the CD event coreference is ECB Cybulska and Vossen (2014) and its predecessors EECB Lee et al. (2012) and ECB Bejan and Harabagiu (2010). ECB+ contains a set of topics, each containing documents describing the same global event. Both event and entity coreference are annotated in ECB+, for within and cross-document coreference resolution.

Models for CD event coreference utilize a range of features from lexical overlap among mention pairs and semantic knowledge from WordNet Bejan and Harabagiu (2010, 2014); Yang et al. (2015), to distributional Choubey and Huang (2017) and contextual representations Kenyon-Dean et al. (2018); Barhom et al. (2019).

The current SOTA model from barhom-etal-2019-revisiting iteratively and intermittently learns to cluster events and entities. The mention representation

consists of several components corresponding to the span representation and the surrounding context. The interdependence between the two tasks is encoded into the mention representation such that an event mention representation contains a component reflecting the current entity clustering, and vice versa. The model trains a pairwise mention scoring function which predicts the probability that two mentions refer to the same event.

2.2 Predicate Paraphrase Identification

There are 3 main approaches for identifying and collecting predicate paraphrases. The first approach considers a pair of predicate templates as paraphrases if the distributions of their argument instantiation are similar in pairs. For instance, in “[a] quit from [a]”, [a] contains people names while [a] contains job titles. A semantically-similar template like “[a] resign from [a]” is expected to have similar argument distributions Lin and Pantel (2001); Berant (2012). The second approach, backtranslation, is applied in a general paraphrasing setup and not specifically to predicates. The idea is that if two English phrases translate to the same term in a foreign language, across multiple foreign languages, it indicates that they are paraphrases. This approach was first suggested by barzilay-mckeown-2001-extracting, later adapted to the large PPDB resource Ganitkevitch et al. (2013)

, and also works well with neural machine translation

Mallinson et al. (2017).

Finally, the third approach, on which we focus in this paper, leverages multiple news documents discussing the same event. The underlying assumption is that such redundant texts may refer to the same entities or events using lexically-divergent mentions, and such co-referring mentions are considered paraphrases. Most work used this approach to extract sentential paraphrases. When long documents are used, the first step in this approach is to align each pair of documents by sentences. This is done by finding sentences with shared named entities Shinyama et al. (2002), lexical overlap Barzilay and Lee (2003); Shinyama and Sekine (2006) and aligning pairs of predicates or arguments Zhang and Weld (2013); Recasens et al. (2013).

More recent work uses news headlines from Twitter. Such texts are concise due to Twitter’s character limit of 280 characters. xu2014extracting extracted sentential paraphrases by finding pairs of tweets with a shared anchor word that discusses the same “trending topic”. lan-etal-2017-continuously extracted pairs of tweets that link to the same URL.

Finally, Chirps Shwartz et al. (2017), which we used in this paper, focused on predicate paraphrases, and has collected more than 5 million distinct pairs over the last 3 years. During its collection, Chirps retrieves tweets linking to news websites, and aim to match between tweets that refer to the same event. It extracts predicate-argument tuples from them and considers as paraphrases pairs of predicates that appeared on tweets posted on the same day and whose arguments are heuristically matched.

3 Leveraging Coreference Annotations to Improve Paraphrasing

Chirps aims to recognize coreferring events by relying on the redundancy of news headlines posted on Twitter on the same day. It extracts binary predicate-argument tuples from each tweet and aligns pairs of predicates whose arguments match (supporting pairs, e.g. [Chuck Berry] died at [90], [Chuck Berry] lived until [90]). The predicate paraphrases, i.e. pairs of predicate templates (e.g. [a] died at [a], [a] lived until [a]) are then aggregated and ranked with a (unsupervised) heuristic scoring function,

.

This score is proportional to the number of supporting pairs in which the two templates were paired (), as well as the number of days in which such pairings were found (). is the number of days the resource is collected. The Chirps resource provides the scored predicate paraphrases as well as the supporting pairs for each paraphrase.

Human evaluation showed that this scoring is effective, and that the percentage of correct paraphrases is higher for highly scored paraphrases. At the same time, due to the heuristic collection and scoring of predicate paraphrases in Chirps, entries in the resource may suffer from two types of errors: (1) type 1 error, i.e., the heuristic recognized pairs of non-paraphrase predicates as paraphrases. This happens when the same arguments participate in multiple events, as in the following paraphrases: “[Police] arrest [man]” and “[Police] shoot [man]”; and (2) type 2 error, i.e., the scoring function assigned a low score to a rare but correct paraphrase pair, as in “[a] outgun [a]” and “[a] outperform [a]”, with a single supporting pair.

To improve the scoring of Chirps paraphrase-pairs, we train a new scorer using distant supervision. We first describe the features we extract to represent a paraphrase pair (Section 3.1). We then describe the distant supervision that we derived semi-automatically from the ECB+ training set (Section 3.2). Finally, we provide the implementation details (Section 3.3).

3.1 Features

Each paraphrase pair consists of two predicate templates and , accompanied by the supporting pairs associated with this paraphrase pair . Each tweet included in Chirps links to a news article, whose content we retrieve. We extract the following features for a predicate paraphrase pair (see the appendix for a full list of features in ):

Named Entity Coverage:

While the original method did not utilize the information of this external article, we find it useful to retrieve more information about the event. Specifically, it might help mitigating errors in Chirps’ argument matching mechanism, which relies on argument alignment considering only the text of the two tweets. We found that the original mechanism worked particularly well for named entities while being more error-prone for common nouns, which might require additional context.

Given , we use SpaCy Honnibal and Montani (2017) to extract sets of named entities, and , respectively. contains the named entities mentioned in the tweet and in the first paragraph of its corresponding news article. We define a Named Entity Coverage score, , as the maximum ratio of named entity coverage of one article by the other:

We manually annotated a small balanced training set of 121 tweet pairs and used it to tune a score threshold , such that pairs of tweets whose is at least are considered coreferring. Finally, we include the following features: the number of coreferring tweet pairs (whose NEC score exceeds ) and the average NEC score of these pairs.

Cross-document Coreference Resolution:

We apply the state-of-the-art cross-document coreference model from barhom-etal-2019-revisiting to the data constructed such that each tweet constitutes a document and each pair of tweets and in forms a topic. As input for the models, in each tweet we mark the corresponding predicate span as an event mention and the two argument spans as entity mentions. The model outputs whether the two event mentions corefer (yielding a single event coreference cluster for the two mentions) or not (yielding two singleton clusters). Similarly, it clusters the four arguments to entity coreference clusters.

Differently from Chirps, this model makes its event clustering decision based on the predicate, arguments, and context, as opposed to the arguments alone. Thus, we expect it not to cluster predicates whose arguments match lexically, if their contexts or predicates don’t match (first example in Table 2). In addition, the model’s mentions representation might help to identify lexically-divergent yet semantically-similar arguments (second example in Table 2).

[Police] [arrest] [two men] in incident at Westboro Beach.
[Police] [kill] [man] in Vegas hospital who grabbed gun.
[Police] [arrest] [man] in incident at Westboro Beach.
[Officers] [seize] [guy] in incident at Westboro Beach.
Table 2: Examples of coreference errors made by Chirps and corrected by barhom-etal-2019-revisiting: 1) false positive: wrong man / two men alignment (disregarding location modifiers). 2) (hypothetical) false negative: lexically-divergent yet semantically-similar arguments.

For a given pair of tweets, we extract the following binary features with respect to the predicate mentions: perfect match when the predicates are assigned to the same cluster, and no match when each predicate forms a singleton cluster. For argument mentions, we extract the following features: perfect match if the two a arguments belong to one cluster and the two a arguments belong to another cluster; reversed match if at least one of the a arguments is clustered as coreferring with the a argument in the other tweet; and no match otherwise.

Connected Components:

The original Chirps score of a predicate paraphrase pair is proportional to two parameters: (1) the number of supporting pairs; (2) the ratio of number of days in which supporting pairs were published relative to the entire collection period. The latter lowers the score of wrong paraphrase pairs which were mistakenly aligned on relatively few days (e.g. due to misleading argument alignments in particular events). The number of days in which the predicates were aligned is taken as a proxy for the number of global events in which the predicates co-refer. Here, we aim to get a more accurate split of tweets to global events by constructing a graph of tweets as nodes and looking for connected components.

To that end, we define a bipartite graph where contains all the tweets in which or appeared, and . We compute , the number of connected components in , and define the following feature: . A larger number of connected components indicates that the two predicates were aligned across a large number of global events.

Clique:

We similarly build a global tweet graph for all the predicate pairs, , where , and . We compute , the set of cliques in . We assume that a pair of tweets are more likely to be coreferring if they are part of a bigger clique, whereas if they were extracted by mistake they wouldn’t share many neighbors. We extract the following feature of clique coverage for a candidate paraphrase pair: .

Pre Annotation Post Annotation
Train # positive 266 803
# negative 2040 1056
Dev # positive 93 222
# negative 539 318
Test # positive 131 352
# negative 758 411
Table 3: Statistics of the paraphrase scorer dataset. The difference in size before and after the annotation is due to omitting examples with less than 3 supporting pairs.

3.2 Distantly Supervised Labels

In order to learn to score the paraphrases, we need gold standard labels, i.e., labels indicating whether a pair of predicate templates collected by Chirps are indeed paraphrases. Instead of collecting manual annotations, we chose a low-budget distant supervision approach. To that end, we leverage the similarity between the predicate paraphrase extraction and the event coreference resolution tasks, and use the annotations from the ECB+ dataset.

As positive examples we consider all pairs of predicates from Chirps that appear in the same event cluster in ECB+, e.g., from {talk, say, tell, accord to, statement, confirm} we extract (talk, say), (talk, tell), …, (statement, confirm).

Obtaining negative examples is a bit trickier. We consider negative examples to be pairs of predicates from Chirps, which are under the same topic, but in different event clusters in ECB+, e.g., given the clusters {specify, reveal, say}, and {get}, we extract (specify, get), (reveal, get), and (say, get).

Note that the ECB+ annotations are context-dependent. Thus a pair of predicates that is in principle coreferable may be annotated as non-coreferring in a given context. To reduce the rate of such false-negative examples, we validated all the negative examples and a sample of the positive examples using Amazon Mechanical Turk. Following shwartz-etal-2017-acquiring, we annotated the templates with 3 argument instantiations from their original tweets. Thus, we only included in the final data predicate pairs with at least 3 supporting pairs. We required that workers have 99% approval rate on at least 1,000 prior tasks and pass a qualification test.

Accuracy Precision Recall
GloVe Baseline 46.1 46.1 1.0 63.1
Chirps Baseline 53.7 49.9 81.3 61.8
This Work 73.8 74.1 66.5 70.1
Table 4:

Test set results of the classifier and the scorer.

Each example was annotated by 3 workers. We aggregated the per-instantiation annotations using majority vote and consider the pair as positive if at least one instantiation was judged as positive. The data statistics are given in Table 3. The validation phase has balanced the positive-negative proportion of instances in the data, from approximately 1:7 to approximately 4:5.

3.3 Model

We trained a random forest classifier

Breiman (2001) implemented by the scikit-learn framework Pedregosa et al. (2011)

. To tune the hyper-parameters, we ran a 3 fold cross-validation randomized search, yielding the following values: 157 estimators, max depth of 8, minimum samples leaf of 1, and min samples split of 10.

111We chose random forest over a neural model because of the small size of the training set and it yielded the best performance over the validation set

Original This Work
1 announce / unveil launch / unveil
2 hit / strike introduce / launch
3 acquire / buy release / unveil
4 reveal / unveil launch / start
5 accuse / say add / bring
6 threaten / warn attack  hit
7 announce / reveal hit / rattle
8 say / warn hit / rock
9 announce / launch buy / snap up
10 kill / murder begin / launch
AP (500) 51.4 59.5
AP (all) 62.5 80.0
Table 5: Average precision on 500 random pairs and on the entire set, along with top 10 ranked test set pairs by Chirps and our method. Pairs labeled as positive are highlighted in purple.

3.4 Evaluation

We used the model for two purposes: (1) classification: determining if a pair of predicate templates are paraphrases or not; and (2) ranking the pairs based on the predicted positive class score.

Classifier Results.

In Table 4 we depict the precision, recall, and accuracy scores on the distantly supervised test set from Section 3.2

. We compared our classifier with two baselines; one based on the original Chirps scores and another based on cosine similarity scores between the two mentions using GloVe embeddings

Pennington et al. (2014).222

Multi-word predicates were represented by the average of their word vectors.

For both baselines we used the train set to learn a threshold for which predicate pairs whose score exceeds are considered paraphrases. Our classifier substantially improved upon the baselines in both accuracy and , by decreasing the false-positive error rate.

Ranking Results.

Table 5 exemplifies the ranking of top predicate pairs by our scorer and the original Chirps scorer. We report the average precision scores on the entire test set “AP (all)” and on a random subset of 500 predicate pairs from the annotated data with at least 6 support pairs each: “AP (500)”, on which our scorer outperforms the Chirps’ scorer by 8 points. The results are statistically significant using bootstrap and permutation tests with Dror et al. (2018).

4 Leveraging a Paraphrasing Resource to Improve Coreference

In Section 3 we showed that CD event coreference annotations can be used to improve predicate paraphrase ranking. In this section, we show that this co-dependence can be used in both directions and that leveraging the improved predicate paraphrase resource can benefit CD coreference. Another way to look at it is as an extrinsic evaluation for the improved Chirps.

MUC B CEAF- CoNLL
Model R P R P R P
Baseline 77.6 84.5 80.9 76.1 85.1 80.3 81 73.8 77.3 79.5
Baseline + Chirps 78.7 84.67 81.61 75.87 85.91 80.5 81.09 74.77 77.8 80.0
Table 6: Event mentions coreference-resolution results on ECB+ test set.
Train Dev Test
Covered Total % Covered Total % Covered Total %
verbal 371 715 51.9 124 231 53.7 196 354 55.4
all 385 1195 32.2 132 390 33.8 199 702 28.3
Table 7: Chirps coverage of co-referring mention pairs in ECB+.

As a preliminary analysis, we computed Chirps’ coverage of pairs of co-referring events in ECB+, and found approximately 30% coverage and above 50% coverage for verbal mentions only, as detailed in Table 7.333Non-verbal mentions in ECB+ include nominalizations (investigation), names (Oscars) acronyms (DUI), and more.

4.1 Integration Method

barhom-etal-2019-revisiting trained a pairwise mention scoring function, , which predicts the probability that two mentions , refer to the same event. The input to is , where denotes an element-wise multiplication and consists of various binary features. We extended the model by changing the input to the pairwise event mention scoring function to , where denotes the Chirps component. We compute in the following way:

is the feature vector representing pair of predicates for which there is an entry in Chirps and NN is an MLP with a single hidden layer of 50, as illustrated in Figure 1. The rest of the model remains the same, including the model architecture, training, and inference.

4.2 Evaluation

Figure 1: New mention pair representation.

We evaluate the event coreference performance on ECB+ using the official CoNLL scorer Pradhan et al. (2014). The reported metrics are MUC Vilain et al. (1995), Bagga and Baldwin (1998), CEAF-e Luo (2005) and CoNLL (the average of MUC, and CEAF- scores).

The results in Table 6 show that the Chirps-enhanced model provides a small improvement upon the baseline in the scores, most of all in the link-based MUC. The performance difference (CoNLL ) of 0.5 point is statistically significant according to bootstrap and permutation tests with .

False Positive Examples
S: […] a jury’s decision to execute Scott Peterson for the murder of his pregnant wife. 0.237
S: found guilty of first-degree murder, […] he deserves to die for his crimes.
S: […] Polygamist leader Warren Jeffs sentenced to life in prison. 0.882
S: Warren Jeffs, […], convicted of sexually assaulting two girls […].
S: […] days after a powerful quake leveled buildings and killed one person […]. 0.112
S: A series of powerful earthquakes […] injuring dozens and destroying hundreds of buildings.
False Negative Examples
S: Indonesia’s West Papua province was hit by a magnitude 6.1 earthquake today. 0.272
S: the latest powerful tremor to shake the region where five people were killed […].
S: Apple Inc. […] unveiled a 17-inch MacBook Pro on Tuesday. 0.575
S: The firm announced a widely expected refresh of its 17in MacBook Pro […].
S: T - Mobile BlackBerry Q10 Pre - Order Begins April 29th for Business Customers. 0.469
S: T - Mobile has announced […] Q10, with pre-registration starting on April 29th.
Table 8: Examples of false positive and false negative errors on ECB+ recovered by Chirps, together with the cosine similarity scores between the predicates, using GloVe Pennington et al. (2014).

4.3 Errors Recovered by Chirps

We analyze the cases in which incorporating knowledge from Chirps helped the model overcome the two types of error:

  1. False Positive: the original model clustered a non-coreferring mention pair together, and our model didn’t. We found 314/1,322 pairs (25.75%), exemplified in the top part of Table 8.

  2. False Negative: coreferring mention pairs that were assigned different clusters in the original model and the same cluster in ours. We found 299/2,823 pairs (10%), exemplified in the bottom part of Table 8.

Although the gap between our model and the original model by barhom-etal-2019-revisiting is statistically significant, it is rather small. We can attribute it partly to the coverage of Chirps over ECB+ (around 30%) which entails that most event mentions have the same representation as in the original model. We also note that ECB+ suffers from annotation errors, as observed by barhom-etal-2019-revisiting and others.

5 Conclusion and Future Work

We studied the synergy between the tasks of identifying predicate paraphrases and event coreference resolution, both pertaining to consolidating lexically-divergent mentions, and showed that they can benefit each other. Using event coreference annotations as distant supervision, we learned to re-rank predicate paraphrases that were initially ranked heuristically, and managed to increase their average precision substantially. In the other direction, we incorporated knowledge from our re-ranked predicate paraphrases resource into a model for event coreference resolution, yielding a small improvement upon previous state-of-the-art results. We hope that our study will encourage future research to make progress on both tasks jointly.

References

  • A. Bagga and B. Baldwin (1998) Algorithms for scoring coreference chains. In The first international conference on language resources and evaluation workshop on linguistics coreference, Vol. 1, pp. 563–566. Cited by: §4.2.
  • S. Barhom, V. Shwartz, A. Eirew, M. Bugert, N. Reimers, and I. Dagan (2019) Revisiting joint modeling of cross-document entity and event coreference resolution. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4179–4189. External Links: Link, Document Cited by: §1, §1, §2.1.
  • R. Barzilay and L. Lee (2003) Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp. 16–23. External Links: Link Cited by: §1, §2.2.
  • R. Barzilay and K. R. McKeown (2001) Extracting paraphrases from a parallel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 50–57. External Links: Link, Document Cited by: §1.
  • C. A. Bejan and S. Harabagiu (2014) Unsupervised event coreference resolution. Computational Linguistics 40 (2), pp. 311–347. Cited by: §2.1.
  • C. Bejan and S. Harabagiu (2010) Unsupervised event coreference resolution with rich linguistic features. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 1412–1422. External Links: Link Cited by: §2.1, §2.1.
  • J. Berant (2012) Global learning of textual entailment graphs. Ph.D. Thesis, Tel Aviv University. Cited by: §1, §2.2.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §3.3.
  • P. K. Choubey and R. Huang (2017) Event coreference resolution by iteratively unfolding inter-dependencies among events. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    ,
    Copenhagen, Denmark, pp. 2124–2133. External Links: Link, Document Cited by: §2.1.
  • A. Cybulska and P. T. J. M. Vossen (2014) Using a sledgehammer to crack a nut? lexical diversity and event coreference resolution. In LREC, Cited by: §1, §2.1.
  • R. Dror, G. Baumer, S. Shlomov, and R. Reichart (2018)

    The hitchhiker’s guide to testing statistical significance in natural language processing

    .
    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1383–1392. External Links: Link, Document Cited by: §3.4.
  • J. Ganitkevitch, B. Van Durme, and C. Callison-Burch (2013) PPDB: the paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 758–764. Cited by: §1, §2.2.
  • M. Honnibal and I. Montani (2017) Spacy.

    Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

    .
    Cited by: §3.1.
  • K. Kenyon-Dean, J. C. K. Cheung, and D. Precup (2018) Resolving event coreference with supervised representation learning and clustering-oriented regularization. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, Louisiana, pp. 1–10. External Links: Link, Document Cited by: §2.1.
  • H. Lee, M. Recasens, A. Chang, M. Surdeanu, and D. Jurafsky (2012) Joint entity and event coreference resolution across documents. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 489–500. External Links: Link Cited by: §2.1.
  • D. Lin and P. Pantel (2001) DIRT - Discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 323–328. Cited by: §1, §2.2.
  • X. Luo (2005) On coreference resolution performance metrics. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, pp. 25–32. External Links: Link Cited by: §4.2.
  • J. Mallinson, R. Sennrich, and M. Lapata (2017) Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 881–893. External Links: Link Cited by: §1, §2.2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §1.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.3.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §3.4, Table 8.
  • S. Pradhan, X. Luo, M. Recasens, E. Hovy, V. Ng, and M. Strube (2014) Scoring coreference partitions of predicted mentions: a reference implementation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland, pp. 30–35. External Links: Link, Document Cited by: §4.2.
  • M. Recasens, M. Can, and D. Jurafsky (2013) Same referent, different words: unsupervised mining of opaque coreferent mentions. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 897–906. External Links: Link Cited by: §2.2.
  • Y. Shinyama, S. Sekine, and K. Sudo (2002) Automatic paraphrase acquisition from news articles. In Proceedings of the second international conference on Human Language Technology Research, pp. 313–318. Cited by: §1, §2.2.
  • Y. Shinyama and S. Sekine (2006) Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, New York City, USA, pp. 304–311. External Links: Link Cited by: §1, §2.2.
  • V. Shwartz, G. Stanovsky, and I. Dagan (2017) Acquiring predicate paraphrases from news tweets. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), Vancouver, Canada, pp. 155–160. External Links: Link, Document Cited by: §1, §1, §2.2.
  • M. Vilain, J. Burger, J. Aberdeen, D. Connolly, and L. Hirschman (1995) A model-theoretic coreference scoring scheme. In Proceedings of the 6th conference on Message understanding, pp. 45–52. Cited by: §4.2.
  • W. Xu, A. Ritter, C. Callison-Burch, W. B. Dolan, and Y. Ji (2014) Extracting lexically divergent paraphrases from twitter. Transactions of the Association for Computational Linguistics 2, pp. 435–448. Cited by: §1.
  • B. Yang, C. Cardie, and P. Frazier (2015) A hierarchical distance-dependent bayesian model for event coreference resolution. Transactions of the Association for Computational Linguistics 3, pp. 517–528. Cited by: §2.1.
  • C. Zhang and D. S. Weld (2013) Harvesting parallel news streams to generate paraphrases of event relations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1776–1786. Cited by: §1, §2.2.

Appendix A Features List

The full feature list of a given predicate paraphrase pair (Section 3.1) includes features from several categories:

a.1 Features from Chirps

  1. Number of templates: the number of different predicate paraphrase pairs with and as predicates, regardless of argument ordering, e.g. { [a] release [a] / [a] reveal [a], release [a] [a] / reveal [a] [a] } yield 2.

  2. Number of supporting pairs: the total number of support pairs of and across the template variants.

  3. Number of days: the total number of days that and was matched in Chirps across the template variants.

  4. Score: the maximal Chirps score across the template variants.

  5. Number of available supporting pairs: the number of support pairs of and across the template variants that were still available to download.

  6. Days of available supporting pairs: the total number of days in which the support pairs above occurred in the available tweets.

a.2 Named Entity Features

As described in Section 3.1:

  1. Above Threshold: number of pairs with NEC score of at least .

  2. Average Above Threshold: average of NEC scores for pairs with a score of at least .

  3. Perfectly Clustered with NE Coverage: the number of pairs with NEC score of at least and perfect clustering for event coreference resolution (Section A.4).

a.3 Graph Features

As described in Section 3.1:

  1. Number of connected components:

  2. Average connected component: the average size of connected components in .

  3. In Clique: the number of pairs in that are in a clique.

a.4 Cross-document Coreference Features

As described in Section 3.1:

  1. Event Perfect: number of event pairs with perfect match.

  2. Event No Match: number of event pairs with no match.

  3. Entity Perfect: number of entity pairs with perfect match.

  4. Entity Reverse: number of entity pairs with reverse match.

  5. Entity No Match: number of entity pairs with no match.