Semantic representation is receiving growing attention in NLP in the past few years, and many proposals for semantic schemes have recently been put forth. Examples include Abstract Meaning Representation (AMR; Banarescu et al., 2013), Broad-coverage Semantic Dependencies (SDP; Oepen et al., 2016), Universal Decompositional Semantics (UDS; White et al., 2016), Parallel Meaning Bank Abzianidze et al. (2017), and Universal Conceptual Cognitive Annotation (UCCA; Abend and Rappoport, 2013). These advances in semantic representation, along with corresponding advances in semantic parsing, hold promise benefit essentially all text understanding tasks, and have already demonstrated applicability to summarization (Liu et al., 2015; Dohare and Karnick, 2017), paraphrase detection Issa et al. (2018), and semantic evaluation (using UCCA; see below).
In addition to their potential applicative value, work on semantic parsing poses interesting algorithmic and modeling challenges, which are often different from those tackled in syntactic parsing, including reentrancy (e.g., for sharing arguments across predicates), and the modeling of the interface with lexical semantics. Semantic parsing into such schemes has been much advanced by recent SemEval workshops, including two tasks on Broad-coverage Semantic Dependency Parsing Oepen et al. (2014, 2015) and two tasks on AMR parsing May (2016); May and Priyadarshi (2017). We expect that a SemEval task on UCCA parsing to have a similar effect. Moreover, given the conceptual similarity between the different semantic representations Abend and Rappoport (2017), it is likely that work on UCCA parsing will directly contribute to the development of other semantic parsing technology. Furthermore, conversion scripts are available between UCCA and the SDP and AMR formats.111https://github.com/huji-nlp/semstr/tree/master/semstr/convert.py We encourage teams that participated in past shared tasks on AMR and SDP to participate using similar systems and a conversion-based protocol.
UCCA is a cross-linguistically applicable semantic representation scheme, building on the established Basic Linguistic Theory typological framework Dixon (2010b, a, 2012). It has demonstrated applicability to multiple languages, including English, French and German (with pilot annotation projects on Czech, Russian and Hebrew), and stability under translation Sulem et al. (2015)
. It has proven useful for defining semantic evaluation measures for text-to-text generation tasks, including machine translationBirch et al. (2016), text simplification Sulem et al. (2018) and grammatical error correction Choshen and Abend (2018) (see §3).
UCCA supports rapid annotation by non-experts, assisted by an accessible annotation interface Abend et al. (2017). The interface is powered by an open-source, flexible web-application for syntactic and semantic phrase-based annotation in general, and for UCCA annotation in particular.222https://github.com/omriabnd/UCCA-App
2 Task Definition
UCCA represents the semantics of linguistic utterances as directed acyclic graphs (DAGs), where terminal (childless) nodes correspond to the text tokens, and non-terminal nodes to semantic units that participate in some super-ordinate relation. Edges are labeled, indicating the role of a child in the relation the parent represents. Nodes and edges belong to one of several layers, each corresponding to a “module” of semantic distinctions.
UCCA’s foundational layer
covers the predicate-argument structure evoked by predicates of all grammatical categories (verbal, nominal, adjectival and others), the inter-relations between them, and other major linguistic phenomena such as semantic heads and multi-word expressions. It is the only layer for which annotated corpora exist at the moment, and will thus be the target of this shared task. The layer’s basic notion is theScene, describing a state, action, movement or some other relation that evolves in time. Each Scene contains one main relation (marked as either a Process or a State), as well as one or more Participants. For example, the sentence “After graduation, John moved to Paris” (Figure 1) contains two Scenes, whose main relations are “graduation” and “moved”. “John” is a Participant in both Scenes, while “Paris” only in the latter. Further categories account for inter-Scene relations and the internal structure of complex arguments and relations (e.g. coordination, multi-word expressions and modification).
UCCA distinguishes primary edges, corresponding to explicit relations, from remote edges (appear dashed in Figure 1) that allow for a unit to participate in several super-ordinate relations. Primary edges form a tree in each layer, whereas remote edges enable reentrancy, forming a DAG.
UCCA graphs may contain implicit units with no correspondent in the text. Figure 2 shows the annotation for the sentence “A similar technique is almost impossible to apply to other crops, such as cotton, soybeans and rice.”. The sentence was used by Oepen et al. (2015) to compare different semantic dependency schemes. It includes a single Scene, whose main relation is “apply”, a secondary relation “almost impossible”, as well as two complex arguments: “a similar technique” and the coordinated argument “such as cotton, soybeans, and rice.” In addition, the Scene includes an implicit argument, which represents the agent of the “apply” relation.
While parsing technology is well-established for syntactic parsing, UCCA has several distinct properties that distinguish it from syntactic representations, mostly UCCA’s tendency to abstract away from syntactic detail that do not affect argument structure. For instance, consider the following examples where the concept of a Scene has a different rationale from the syntactic concept of a clause. First, non-verbal predicates in UCCA are represented like verbal ones, such as when they appear in copula clauses or noun phrases. Indeed, in Figure 1, “graduation” and “moved” are considered separate Scenes, despite appearing in the same clause. Second, in the same example, “John” is marked as a (remote) Participant in the graduation Scene, despite not being explicitly mentioned. Third, consider the possessive construction in “John’s trip home”. While in UCCA “trip” evokes a Scene in which “John” is a Participant, a syntactic scheme would analyze this phrase similarly to “John’s shoes”.
The differences in the challenges posed by syntactic parsing and UCCA parsing, and more generally semantic parsing, motivate the development of targeted parsing technology to tackle it.
The only existing parser for UCCA is TUPA Hershcovich et al. (2017), a neural transition-based parser (see §5), which will serve as a baseline for the proposed task. Hershcovich et al. (2017) received considerable attention when presented in ACL last year, including an Outstanding Paper Award, underscoring the timeliness of this proposal.
3 Existing Application: Text-to-text Generation Evaluation
UCCA has shown applicability to text-to-text generation evaluation, in three recently published works. The HUME (Human UCCA-based MT Evaluation) measure Birch et al. (2016) is a human evaluation, which uses UCCA to decompose the source sentence into semantic units, which are then individually evaluated manually. It thus addresses the diminishing inter-annotator agreement rates of human ranking measures with the sentence length, as well as provides an indication of which parts of the source are incorrectly translated. UCCA is an appealing analysis for such ends, due to its cross-linguistic applicability. HUME is evaluated on four language pairs, demonstrating good inter-annotator agreement rates and correlation with human adequacy scores.
The SAMSA measure (Simplification Automatic evaluation Measure through Semantic Annotation) Sulem et al. (2018) for text simplification is the first measure to address structural aspects of text simplification. It uses UCCA on the source side to compare the input and the output of a simplification system. The use of UCCA allows measuring the extent to which the meaning of the input is preserved in the output, and whether the input sentence was split to semantic units of the right granularity. Experiments show that SAMSA correlates with human system-level ranking, unlike existing measures for text simplification, that seem to focus on lexical simplification.
For grammatical error correction, USim (UCCA similarity) Choshen and Abend (2018) is the first measure to allow for a reference-less complement to grammaticality measure, together creating a reference less measure that scores meaning preservation and grammaticality of corrections. Similarly to SAMSA, the use of UCCA provides means to measure the extent to which the meaning of the input is preserved in the output. It was shown that UCCA structures are hardly affected by human corrections but are greatly affected by corrections from systems with low reference-based corrections.
4 Data & Resources
Competitions will be carried out in three languages: English, German and French. For English, we will use the Wikipedia UCCA corpus (henceforth Wiki) and the UCCA Twenty Thousand Leagues Under the Sea English-French-German parallel corpus (henceforth 20K Leagues), which includes manual UCCA annotation for the entire book on the German side, and UCCA manual annotation for the first five chapters on the French and English sides. The statistics for the corpora are given in Table 4.
The corpora were manually annotated and reviewed by a second annotator, additionally, they passed automatic validation and normalization scripts. All UCCA corpora are freely available333https://github.com/UniversalConceptualCognitiveAnnotation under the Creative Commons Attribution-ShareAlike 3.0 Unported license,444http://creativecommons.org/licenses/by-sa/3.0.
In addition to the in-domain test set, the English part of the 20K Leagues corpus (12K tokens; Sulem et al., 2015) will serve as an out-of-domain test set. In German, the train, development, and test sets will be taken from the 20K Leagues corpus, and will jointly consist of 6004 sentences, corresponding to about 136K tokens. Given the small amount of annotated data available for French, we will only provide development and test sets for this setting. We expect systems for French to use semi-supervised approaches, such as cross-lingual learning or structure projection using the parallel corpus, or relying on datasets with related formalisms such as Universal Dependencies Nivre et al. (2016). We will also release the full unannotated 20K Leagues corpus, tokenized and aligned (automatically) in the three languages, in order to facilitate participants to take cross-lingual approaches.
We provide a validation script that can be used on system outputs.555https://github.com/huji-nlp/ucca/blob/master/scripts/validate.py Its goal is to rule out cases that are inconsistent with the UCCA annotation guidelines. For example, a Scene, defined by the presence of a Process or a State, should include at least one Participant. We will provide a full documentation of the validation rules, to allow participants to integrate them as constraints into their parsers.
Data sets are released XML format, including tokenized text automatically pre-processed using spaCy (see §6), and gold-standard UCCA annotation for the train and development sets.666https://github.com/UniversalConceptualCognitiveAnnotation/docs/blob/master/FORMAT.md An example XML is given in Figure 3.777https://github.com/UniversalConceptualCognitiveAnnotation/docs/blob/master/toy.xml These representations can be read and manipulated using the UCCA toolkit.888https://github.com/huji-nlp/ucca Test sets will follow the same format, but without layer 1 (i.e., the element <layer layerID="1"><).
5 Pilot Task
, presenting TUPA, a transition-based DAG parser based on a BiLSTM-based classifier.999https://github.com/huji-nlp/tupa
Several baselines have been proposed, using different classifiers (sparse perceptron or feedforward neural network), and using conversion-based approaches that use existing parsers for other formalisms to parse UCCA by constructing a two-way conversion protocol between the formalisms. TUPA has shown superior performance over all such approaches, and will thus serve as a strong baseline for systems submissions to the shared task.
Experiments were done in several settings:
English in-domain, training on the Wiki training set and testing on the Wiki test set,
English out-of-domain, training on the Wiki training set but testing on the 20K Leagues test set,
German in domain, training on the 20K Leagues training set and testing on its test set,
French in domain, training on the 20K Leagues training set and testing on its test set.
The pilot task results are summarized in Table 1. It was carried out on version 1.2 of the Wiki dataset, and version 1.0 of the English 20K Leagues dataset. An unpublished experiment using a perceptron with randomly-initialized embedding inputs, instead of sparse features, yielded poor results (TUPADense). Another unpublished experiment with an ensemble of three BiLSTM models (TUPABiLSTM PoE) yielded the best results on primary edges in the in-domain setting (75% ), but no improvement on remote edges (48.7% ). An improvement was also observed in the out-of-domain setting, obtaining 69.6% on primary edges and 28% on remote edges.101010For ensembling, during inference we used Product of Experts (PoE; Hinton, 2002) to combine the predictions of three models trained in the same setting, but with different random seeds.
Taking a conversion-based approach using existing parsers yielded varying performance. The best performing a conversion to bi-lexical trees, followed by a stack-LSTM dependency tree parser Dyer et al. (2015), which after conversion reached 69.9% on primary edges in the in-domain setting. Nevertheless, given the parser produces trees, rather than DAGs, it was unable to produce any remote edges.
|Wiki (in-domain)||20K Leagues (out-of-domain)|
|Bilexical Approximation (Dependency DAG Parsers)|
|Tree Approximation (Constituency Tree Parser)|
|Bilexical Tree Approximation (Dependency Tree Parsers)|
Finally, Hershcovich et al. (2018) showed improvements from multitask learning to UCCA parsing, using AMR Banarescu et al. (2013), SDP Oepen et al. (2016) and UD Nivre et al. (2016) as auxiliary tasks (TUPABiLSTM MTL). They also carried out experiments on version 1.0 of the French 20K Leagues dataset (splitting it to training, development and test, despite its small size), and on version 0.9 (pre-release) of the German 20K dataset. For French and German, only UD was used as an auxiliary task in multitask learning (TUPABiLSTM MTL).
TUPA will be used as a baseline for the French and German settings as well. Since the task will include no training data for French, we will use a version trained on English and German, with a delexicalized feature set (not included in the pilot task).
Participants in the task will be evaluated in four settings:
English in-domain setting, using the Wiki corpus.
English out-of-domain setting, using the Wiki corpus as training and development data, and 20K Leagues as test data.
German in-domain setting, using the 20K Leagues corpus.
French setting with no training data, using the 20K Leagues as development and test data.
In order to allow both even ground comparison between systems and using hitherto untried resources, we will hold both an open and a closed track for submissions in the English and German settings. Closed track submissions will only be allowed to use the gold-standard UCCA annotation distributed for the task in the target language, and will be limited in their use of additional resources. Concretely, the additional data they will be allowed to use will only consist of that used by TUPA, which consists of automatic annotations provided by spaCy Honnibal and Montani (2018):111111http://spacy.io POS tags, syntactic dependency relations, and named entity types and spans. In addition, the closed track will allow the use of word embeddings provided by fastText Bojanowski et al. (2017)121212http://fasttext.cc for all languages.
Systems in the open track, on the other hand, will be allowed to use any additional resource, such as UCCA annotation in other languages, dictionaries or datasets for other tasks, provided that they make sure not to use any additional gold standard annotation over the same text used in the UCCA corpora.131313We are not aware of any such annotation, but include this restriction for completeness. In both tracks, we will require that submitted systems will not be trained on the development data. Due to the absence of an established pilot study for French, we will only hold an open track for this setting.
The four settings and two tracks result in a total of 7 competitions, where a team may participate in anywhere between 1 and 7 of them. We will encourage submissions in each track to use their systems to produce results in all settings. In addition, we will encourage closed-track submissions to also submit to the open track.
In order to evaluate how similar an output graph is to a gold one, we use DAG . Formally, over two UCCA annotations and that share their set of leaves (tokens) and for a node in or , define its yield as its set of leaf descendants. Define a pair of edges and to be matching if
and they have the same label. Labeled precision and recall are defined by dividing the number of matching edges inand by and respectively. DAG
, their harmonic mean, collapses to the common parsingif are trees.
This measure disregards implicit nodes. We aim to extend it to include them, defining implicit units in and in to be matching, if and only if they have the same label and , where denotes the parent of a unit.
For a more fine-grained evaluation, Precision, Recall and of specific category (edge labels) sets will also be reported. UCCA labels can be divided into categories that correspond to Scene elements (States, Processes, Participants, Adverbials), non-Scene elements (Elaborators, Connectors, Centers), and inter-Scene Linkage (Parallel Scenes, Linkage, Ground). We will report performance for each of these sets separately, leaving out Function and Relator units that do not belong to any particular model.
An official evaluation script providing both coarse-grained and fine-grained scores is available.141414https://github.com/huji-nlp/ucca/blob/master/scripts/evaluate_standard.py
7 Task organizers
Daniel Hershcovich. PhD candidate at the Hebrew University of Jerusalem. firstname.lastname@example.org. Research interests: semantic parsing, structure prediction, transition-based parsing. Relevant experience: first author on the TUPA papers Hershcovich et al. (2017, 2018). Development and maintenance of the UCCA toolkit Python codebase.
Leshem Choshen. PhD candidate at the Hebrew University of Jerusalem. email@example.com. Research interests: multi-lingual text-to-text generation, multi-modal semantics. Relevant experience: first author on the GEC UCCA-based evaluation measure USim Choshen and Abend (2018).
Elior Sulem. PhD candidate at the Hebrew University of Jerusalem. firstname.lastname@example.org. Research interests: sentence-level semantics, cross-linguistic divergences, text simplification, machine translation. Relevant experience: First author on a cross-linguistic divergence study with UCCA annotation Sulem et al. (2015) and on the paper presenting the UCCA-based structural semantic evaluation for Text Simplification SAMSA Sulem et al. (2018).
Zohar Aizenbud. MSc student at the Hebrew University of Jerusalem. email@example.com. Research interests: machine translation, computational semantics.
Ari Rappoport. Associate Professor of Computer Science at The Hebrew University of Jerusalem. firstname.lastname@example.org. Research interests: computational semantics, semantic parsing, language in the brain. Relevant experience: co-developer of the UCCA scheme. Published about 60 papers in NLP conferences (ACL, NAACL, EMNLP etc.) in the last 11 years. Supervised 25 graduate student theses in NLP.
Omri Abend. Senior Lecturer (Assistant Professor) of Computer Science and Cognitive Science in the Hebrew University of Jerusalem. email@example.com. Research interests: computational semantics and specifically, cross-linguistically applicable semantic and grammatical representation, semantic parsing, corpus annotation, evaluation. Relevant experience: co-developer of the UCCA scheme, partner in all annotation, parsing and application efforts related to UCCA, publishes regularly in NLP conferences (ACL, NAACL, EMNLP etc.).
- Abend and Rappoport (2013) Omri Abend and Ari Rappoport. 2013. Universal Conceptual Cognitive Annotation (UCCA). In Proc. of ACL, pages 228–238.
- Abend and Rappoport (2017) Omri Abend and Ari Rappoport. 2017. The state of the art in semantic representation. In Proc. of ACL, pages 77–89.
- Abend et al. (2017) Omri Abend, Shai Yerushalmi, and Ari Rappoport. 2017. UCCAApp: Web-application for syntactic and semantic phrase-based annotation. Proc. of ACL System Demonstrations, pages 109–114.
- Abzianidze et al. (2017) Lasha Abzianidze, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann, Duc-Duy Nguyen, and Johan Bos. 2017. The parallel meaning bank: Towards a multilingual corpus of translations annotated with compositional meaning representations. CoRR, abs/1702.03964.
- Almeida and Martins (2015) Mariana S. C. Almeida and André F. T. Martins. 2015. Lisbon: Evaluating TurboSemanticParser on multiple languages and out-of-domain data. In Proc. of SemEval, pages 970–973.
- Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for sembanking. In Proc. of the Linguistic Annotation Workshop.
- Birch et al. (2016) Alexandra Birch, Omri Abend, Ondřej Bojar, and Barry Haddow. 2016. HUME: Human UCCA-based evaluation of machine translation. In Proc. of EMNLP, pages 1264–1274.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, 5:135–146.
- Choshen and Abend (2018) Leshem Choshen and Omri Abend. 2018. Reference-less measure of faithfulness for grammatical error correction. In Proc. of NAACL (to appear).
- Dixon (2010a) Robert M. W. Dixon. 2010a. Basic Linguistic Theory: Grammatical Topics, volume 2. Oxford University Press.
- Dixon (2010b) Robert M. W. Dixon. 2010b. Basic Linguistic Theory: Methodology, volume 1. Oxford University Press.
- Dixon (2012) Robert M. W. Dixon. 2012. Basic Linguistic Theory: Further Grammatical Topics, volume 3. Oxford University Press.
- Dohare and Karnick (2017) Shibhansh Dohare and Harish Karnick. 2017. Text summarization using abstract meaning representation. CoRR, abs/1706.01678.
- Dyer et al. (2015) Chris Dyer, Miguel Ballesteros, Wang Ling, Austin Matthews, and Noah A. Smith. 2015. Transition-based dependeny parsing with stack long short-term memory. In Proc. of ACL, pages 334–343.
- Hershcovich et al. (2017) Daniel Hershcovich, Omri Abend, and Ari Rappoport. 2017. A transition-based directed acyclic graph parser for UCCA. In Proc. of ACL, pages 1127–1138.
- Hershcovich et al. (2018) Daniel Hershcovich, Omri Abend, and Ari Rappoport. 2018. Multitask parsing across semantic representations. In Proc. of ACL. To appear.
Geoffrey E Hinton. 2002.
Training products of experts by minimizing contrastive divergence.Neural computation, 14(8):1771–1800.
Honnibal and Montani (2018)
Matthew Honnibal and Ines Montani. 2018.
spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.To appear.
- Issa et al. (2018) Fuad Issa, Marco Damonte, Shay B. Cohen, Xiaohui Yan, and Yi Chang. 2018. Abstract meaning representation for paraphrase detection. In Proc. of NAACL (to appear).
- Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional LSTM feature representations. TACL, 4:313–327.
- Liu et al. (2015) Fei Liu, Jeffrey Flanigan, Sam Thomson, Norman Sadeh, and Noah A. Smith. 2015. Toward abstractive summarization using semantic representations. In Proc. of NAACL-HLT, pages 1077–1086, Denver, Colorado.
- Maier and Lichte (2016) Wolfgang Maier and Timm Lichte. 2016. Discontinuous parsing with continuous trees. In Proc. of Workshop on Discontinuous Structures in NLP, pages 47–57.
- May (2016) Jonathan May. 2016. SemEval-2016 task 8: Meaning representation parsing. In Proc. of SemEval, pages 1063–1073.
- May and Priyadarshi (2017) Jonathan May and Jay Priyadarshi. 2017. SemEval-2017 task 9: Abstract Meaning Representation parsing and generation. In Proc. of SemEval, pages 536–545.
- Nivre et al. (2007) Joakim Nivre, Johan Hall, Jens Nilsson, Atanas Chanev, Gülsen Eryigit, Sandra Kübler, Svetoslav Marinov, and Erwin Marsi. 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(02):95–135.
- Nivre et al. (2016) Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC.
- Oepen et al. (2016) Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinkova, Dan Flickinger, Jan Hajic, Angelina Ivanova, and Zdenka Uresova. 2016. Towards comparability of linguistic graph banks for semantic parsing. In Proc. of LREC.
- Oepen et al. (2015) Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Silvie Cinková, Dan Flickinger, Jan Hajič, and Zdeňka Urešová. 2015. SemEval 2015 task 18: Broad-coverage semantic dependency parsing. In Proc. of SemEval, pages 915–926.
- Oepen et al. (2014) Stephan Oepen, Marco Kuhlmann, Yusuke Miyao, Daniel Zeman, Dan Flickinger, Jan Hajič, Angelina Ivanova, and Yi Zhang. 2014. SemEval 2014 task 8: Broad-coverage semantic dependency parsing. In Proc. of SemEval, pages 63–72.
- Ribeyre et al. (2014) Corentin Ribeyre, Eric Villemonte de la Clergerie, and Djamé Seddah. 2014. Alpage: Transition-based semantic graph parsing with syntactic features. In Proc. of SemEval, pages 97–103.
- Sulem et al. (2015) Elior Sulem, Omri Abend, and Ari Rappoport. 2015. Conceptual annotations preserve structure across translations: A French-English case study. In Proc. of S2MT, pages 11–22.
- Sulem et al. (2018) Elior Sulem, Omri Abend, and Ari Rappoport. 2018. Semantic structural evaluation for text simplification. In Proc. of NAACL (to appear).
White et al. (2016)
Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang,
Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016.
Universal decompositional semantics on universal dependencies.
Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1713–1723.