Named entity recognition (NER), entity linking (EL) and relation extraction (RE) are fundamental tasks in information extraction, and a key component in numerous downstream applications, such as question answering Yu et al. (2017) and knowledge base population Ji and Grishman (2011)
. Recent neural approaches based on pre-trained language models (e.g., BERTDevlin et al. (2019)) have shown impressive results for these tasks when fine-tuned on supervised datasets Akbik et al. (2018); De Cao et al. (2021); Alt et al. (2019). However, annotated datasets for fine-tuning information extraction models are still scarce, even in a comparatively well-resourced language such as German Benikova et al. (2014), and generally only contain annotations for a single task (e.g., for NER CoNLL’03 German Tjong Kim Sang and De Meulder (2003), GermEval 2014 Benikova et al. (2014); entity linking GerNED Ploch et al. (2012)). In addition, research in multi-task Ruder (2017) and joint learning Sui et al. (2020) has shown that models can benefit from exploiting training signals of related tasks. To the best of our knowledge, the work of Schiersch et al. (2018) is the only dataset for German that includes two of the three tasks, namely NER and RE, in a single dataset.
In this work, we present MobIE, a German-language information extraction dataset which has been fully annotated for NER, EL, and n-ary RE. The dataset is based upon a subset of documents provided by Schiersch et al. (2018), but focuses on the domain of mobility-related events, such as traffic obstructions and public transport issues. Figure 1 displays an example traffic report with a Canceled Route event. All relations in our dataset are n-ary, i.e. consist of two or more arguments, some of which are optional. Our work expands the dataset of Schiersch et al. (2018) with the following contributions:
We significantly extend the dataset with 1,686 annotated documents, more than doubling its size from 1,546 to 3,232 documents
We add entity linking annotations to geo-linkable entity types, with references to Open Street Map111https://www.openstreetmap.org/ identifiers, as well as geo-shapes
We implement an automatic labeling approach using the Snorkel framework Ratner et al. (2017) to obtain additional high quality, but weakly-supervised relation annotations
The dataset setup allows for training and evaluating algorithms that aim for fine-grained typing of geo-locations, entity linking of these, as well as for n-ary relation extraction. The final dataset contains entity, linking, and relation annotations.
2 Data Collection and Annotation
2.1 Annotation Process
We collected German Twitter messages and RSS feeds based on a set of predefined search keywords and channels (radio stations, police and public transport providers) continuously from June 2015 to April 2019 using the crawlers and configurations provided by Schiersch et al. (2018), and randomly sampled documents from this set for annotation. The documents, including metadata, raw source texts, and annotations, are stored with a fixed document schema as AVRO222avro.apache.org and JSONL files, but can be trivially converted to standard formats such as CONLL. Each document was labeled iteratively, first for named entities and concepts, then for entity linking information, and finally for relations. For all manual annotations, documents are first annotated by a single trained annotator, and then the annotations are validated by a second expert. All annotations are labeled with their source, which e.g. allows to distinguish manual from weakly supervised relation annotations (see Section 2.4).
Table 3 lists entity types of the mobility domain that are annotated in our corpus. All entity types except for event_cause originate from the corpus of Schiersch et al. (2018). The main characteristics of the original annotation scheme are the usage of coarse- and fine-grained entity types (e.g., organization, organization-company, location, location-street), as well as trigger entities for phrases which indicate annotated relations, e.g., “Stau” (“traffic jam”). We introduce a minor change by adding a new entity type label event_cause, which serves as a label for concepts that do not explicitly trigger an event, but indicate its potential cause, e.g., “technische Störung” (“technical problem”) as a cause for a Delay event.
2.3 Entity Linking
In contrast to the original corpus, our dataset includes entity linking information. We use Open Street Map (OSM) as our main knowledge base (KB), since many of the geo-entities, such as streets and public transport routes, are not listed in standard KBs like Wikidata. We link all geo-locatable entities, i.e. organizations and locations, to their KB identifiers, and external identifiers (Wikidata) where possible. We include geo-information as an additional source of ground truth whenever a location is not available in OSM333This is mainly the case for location-route and location-stop entities, which are derived from proprietary KBs of Deutsche Bahn and Rhein-Main-Verkehrsverbund. Standardized ids for these entity types, e.g. DLID/DHID, were not yet available at the time of creation of this dataset.. Geo-information is provided as points and polygons in WKB format444https://www.ogc.org/standards/sfa.
Figure 2 shows the annotation tool used for entity linking. The tool displays the document’s text, lists all annotated geo-location entities along with their types, and a list of KB candidates retrieved. The annotator first checks the quality of the entity type annotation, and may label the entity as incorrect if applicable. Then, for each valid entity the annotator either labels one of the candidates shown on the map as correct, or they select missing if none of the candidates is correct.
|Canceled Stop||default-args, route|
|Rail Repl. Serv.||default-args, delay|
|Traffic Jam||default-args, delay, jam-length|
Table 1 lists relation types and their arguments. The relation set focuses on events that may negatively impact traffic flow, such as Traffic Jams and Accidents. All relations have a set of required and optional arguments, and are labeled with their annotation source, i.e., human or weakly-supervised. Different relations may co-occur in a single sentence, e.g. Accidents may cause Traffic Jams, which are often reported together.
Human annotation. The annotation in Schiersch et al. (2018) is performed manually. Annotators labeled only explicitly expressed relations where all arguments occurred within a single sentence. The authors report an inter-annotator agreement of (Cohen’s ) for relations.
Automatic annotation with Snorkel.
To reduce the amount of labor required for relation annotation, we explored an automatic, weakly supervised labeling approach. Our intuition is that due to the formulaic nature of texts in the traffic report domain, weak heuristics that exploit the combination of trigger key phrases and specific location types provide a good signal for relation labeling. For example,“A2 Dortmund Richtung Hannover 2 km Stau” is easily identified as a Traffic Jam relation mention due to the occurrence of the “Stau” trigger in combination with the road name “A2”.
We use the Snorkel weak labeling framework Ratner et al. (2017). Snorkel unifies multiple weak supervision sources by modeling their correlations and dependencies, with the goal of reducing label noise Ratner et al. (2016)
. Weak supervision sources are expressed as labeling functions (LFs), and a label model combines the votes of all LFs weighted by their estimated accuracies and outputs a set of probabilistic labels (see Figure3).
We implement LFs for the relation classification of trigger concepts and role classification of trigger-argument concept pairs. The output is used to reconstruct n-ary relation annotations. Trigger classification LFs include keyword list checks as well as examining contextual entity types. Argument role classification LFs are inspired by Chen and Ji (2009), and include distance heuristics, entity type of the argument, event type output of the trigger labeling functions, context words of the argument candidate, and relative position of the entity to trigger. We trained the Snorkel label model on all unlabeled documents in the dataset that contained at least a trigger entity (690 documents). The probabilistic relation type and argument role labels were then combined into n-ary relation annotations.
We verified the performance of the Snorkel model using a randomly selected development subset of 55 documents with human-annotated relations. On this dev set, Snorkel-assigned trigger class labels achieved a F1-score of (Accuracy: ), and role labeling of trigger-argument pairs had a F1-score of (Accuracy: ). This confirms our intuition that for the traffic report domain, weak labeling functions can provide useful supervision signals.
3 Dataset Statistics
We report the statistics of the MobIE dataset in Table 2. The majority of documents originate from Twitter, but RSS messages are longer on average, and typically contain more annotations (e.g., entities/doc versus entities/doc for Twitter). The annotated corpus is provided with a standardized Train/Dev/Test split. To ensure a high data quality for evaluating event extraction, we include only documents with manually annotated events in the Test split.
Table 3 lists the distribution of entity annotations in the dataset, Table 4 the distribution of linked entities. Of the annotated entities covering 20 entity types, organization* and location* entities are linked, either to a KB reference id, or marked as NIL. The remaining entities are non-linkable types, such as time and date expressions. The fraction of NILs among linkable entities is % overall, but varies significantly with entity type. Locations that could not be assigned to a specific subtype are more often resolved as NIL. A large fraction of these are highway exits (e.g. “Pforzheim-Ost”) and non-German locations, which were not included in the subset of OSM integrated in our KB. In addition, candidate retrieval for organizations often returned no viable candidates, especially for non-canonical name variants used in tweets.
The dataset contains annotated traffic events, manually annotated and obtained via weak supervision. Table 5 shows the distribution of relation types. Canceled Stop and Rail Replacement Service relations occur less frequently in our data than the other relation types, and Obstruction is the most frequent class.
|# entities||# KB||# NIL|
|Rail Replacement Service||71||27||98|
We presented a dataset for named entity recognition, entity linking and relation extraction in German mobility-related social media texts and traffic reports. Although not as large as some popular task-specific German datasets, the dataset is, to the best of our knowledge, the first German-language dataset that combines annotations for NER, EL and RE, and thus can be used for joint and multi-task learning of these fundamental information extraction tasks. The dataset is freely available under a CC-BY 4.0 license at https://github.com/dfki-nlp/mobie.
We would like to thank Elif Kara, Ursula Strohriegel and Tatjana Zeen for the annotation of the dataset. This work has been supported by the German Federal Ministry of Transport and Digital Infrastructure as part of the project DAYSTREAM (01MD19003E), and by the German Federal Ministry of Education and Research as part of the project CORA4NLP (01IW20010).
- Akbik et al. (2018) Alan Akbik, Duncan Blythe, and Roland Vollgraf. 2018. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1638–1649, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Alt et al. (2019) Christoph Alt, Marc Hübner, and Leonhard Hennig. 2019. Improving Relation Extraction by Pre-trained Language Representations. In Proceedings of AKBC 2019, pages 1–18, Amherst, Massachusetts.
- Benikova et al. (2014) Darina Benikova, Chris Biemann, and Marc Reznicek. 2014. NoSta-D Named Entity Annotation for German: Guidelines and Dataset. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2524–2531, Reykjavik, Iceland. European Language Resources Association (ELRA). ACL Anthology Identifier: L14-1251.
- Chen and Ji (2009) Zheng Chen and Heng Ji. 2009. Language specific issue and feature exploration in Chinese event extraction. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 209–212, Boulder, Colorado. Association for Computational Linguistics.
- De Cao et al. (2021) Nicola De Cao, Gautier Izacard, Sebastian Riedel, and Fabio Petroni. 2021. Autoregressive Entity Retrieval. In Proceedings of ICLR 2021. ArXiv: 2010.00904.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Ji and Grishman (2011) Heng Ji and Ralph Grishman. 2011. Knowledge Base Population: Successful Approaches and Challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Techologies, pages 1148–1158, Portland, Oregon, USA. Association for Computational Linguistics.
- Ploch et al. (2012) Danuta Ploch, Leonhard Hennig, Angelina Duka, Ernesto William De Luca, and Sahin Albayrak. 2012. GerNED: A German corpus for named entity disambiguation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 3886–3893, Istanbul, Turkey. European Language Resources Association (ELRA).
- Ratner et al. (2017) Alexander Ratner, Stephen H. Bach, Henry Ehrenberg, Jason Fries, Sen Wu, and Christopher Ré. 2017. Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment, 11(3):269–282. ArXiv: 1711.10160.
- Ratner et al. (2016) Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. 2016. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Ruder (2017) Sebastian Ruder. 2017. An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098 [cs, stat]. ArXiv: 1706.05098.
- Schiersch et al. (2018) Martin Schiersch, Veselina Mironova, Maximilian Schmitt, Philippe Thomas, Aleksandra Gabryszak, and Leonhard Hennig. 2018. A German corpus for fine-grained named entity recognition and relation extraction of traffic and industry events. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
- Sui et al. (2020) Dianbo Sui, Yubo Chen, Kang Liu, Jun Zhao, Xiangrong Zeng, and Shengping Liu. 2020. Joint Entity and Relation Extraction with Set Prediction Networks. arXiv:2011.01675 [cs]. ArXiv: 2011.01675 version: 2.
- Tjong Kim Sang and De Meulder (2003) Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147.
- Yu et al. (2017) Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2017. Improved Neural Relation Detection for Knowledge Base Question Answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 571–581, Vancouver, Canada. Association for Computational Linguistics.