Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora

07/26/2018 ∙ by Sajawel Ahmed, et al. ∙ IG Farben Haus 0

This study improves the performance of neural named entity recognition by a margin of up to 11 German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset. Rather than designing deeper and wider hybrid neural architectures, we gather all available resources and perform a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech tagging prior to exposing the raw data to any training process. We test our approach in a threefold monolingual experimental setup of a) single, b) joint, and c) optimized training and shed light on the dependency of downstream-tasks on the size of corpora used to compute word embeddings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Named Entity Recognition (NER) is a crucial part of various Natural Language Processing (NLP) tasks like entity linking, relation extraction, machine reading and ultimately Question Answering (QA). With the recent rise of neural networks, much emphasis has been put on high-resource languages like English or Chinese leading to fast advancements of many foundational tasks, in particular NER which in many areas reaches near-human performance for these languages [1, 2]

. However, for other, less-resource languages like German, their neural NER counterparts did not attract similar attention from the deep learning community, leading to lower performance by a margin of up to 11% F-score.

In this paper, we look for the reasons and take steps towards solving them. By example of German we bridge the current gap between the performance of neural NER for different languages and bring the performance to a new state-of-the-art. We report evidence that the inferior quality of German text data and its small size are the major reasons for the observed lack of progress.

To tackle this problem, we use a larger corpus for training the foundational word embeddings, namely Leipzig40 [3] (including the whole German Wikipedia till 2016) combined with the WMT 2010 German monolingual training data [4], and contrast its use with the COW corpus [5], the largest collection of German texts extracted from web documents with over 617 Mio. sentences. Besides, we bring all scattered (open-source) resources of annotated NER datasets for German together which are to date available, prepare and merge them to increase the amount of the final training data. This includes the major NER datasets of CoNLL-2003 [6] and GermEval-2014 [7], and the smaller datasets of Europarl-2010 [8] and of EuropeanaNewspapers-2016 [9]. To this collection, we add the dataset of Tübingen Treebank (TüBa-D/Z) [10], which to the knowledge of the authors is utilized the first time for the task of neural NER.

It is an increasing scientific practice to make models open source accessible. New models appear almost daily, for example in the Deep Learning (DL) community. As a consequence, changing existing models and trying out different hybrid setups is getting a scientific practice involving more and more scientists. This is advantageous, since attempts to improve existing models can contribute to their validation. However, it is often forgotten that data is the gold of scientist: it is the availability of limited resources that leads to significant improvements in various areas such as CoNLL, SNLI [11] and SQuAD [12] for the tasks NER, natural language inference and QA and stand behind the recent success of neural networks in NLP. Therefore it is important to consider sufficient available resources, to annotate them according to the task and to optimize them if necessary. This task is often time-consuming and costly. The present paper deals with assessing the impact of resources to NER by example of a rather low-resource language like German. We show the influence of different training sets on the performance of neural NER, of different combinations of these data sets and above all of different levels of their preprocessing. We deal with the aspect of resource optimization with regard to lemmatization and Part-of-Speech (POS) tagging and analyze their influence besides the training of word embeddings and task-specific neural networks. Our main finding is: an increase of size and quality of the (task-independent) word embedding corpus and of the (task-specific) training dataset leads to a significant improvement of sequence labeling tasks like NER, which can be larger than just an amendment of the underlying neural architecture. For the future of neural NER by example of less- or low-resource languages this means: collecting unlabeled corpora for training morphology-dependent, high quality embeddings is a good alternative to increase the performance of downstream-tasks.

The remainder of the paper is organized as follows: Section 2 reviews related work, Section 3 presents a sketch of the underlying model, Section 4 describes our threefold experimental setup of a) single, b) joint, and c) resource optimized training, Section 5 reports and discusses our results, and, finally, Section 6 draws a conclusion.

Ii Related Work

Compared to high-resource languages, comparatively less emphasis has been put on the task of neural NER by example of German. Noteworthy work has been done so far only by [13] on GermEval and by [1] on CoNLL; both will be used as baselines here. Reimers et al. [13] were among the first to apply neural networks to German NER. However, they did not consider GermEval in combination with CoNLL. Apart from them, the remaining studies (predominantly conducted by non-native speakers) consider this task as a side product of dealing with various other languages. In this way, the state-of-the-art on German neural NER has been established by [1] in 2016.

Gillick et al. [14] consider German as a variant in a multilingual training setup while additionally considering the datasets of two Germanic languages (English and Dutch) and one Romanic language (Spanish) from the CoNLL shared task; as a result, they reach 76.22 % F-score. However, for the single training on the German part of CoNLL they stay below [13].

From the point of view of resource optimization, the recent work of [15] is worth mentioning. Klimek et al. also observe the gap between the languages and therefore carry out a detailed analysis of the difficulties for the German NER task using the GermEval data set as an example. They come to the conclusion that “the task of German NER could benefit from integrating morphological processing” [15]. To this end, we start our analysis and apply our designed morphological processing approach to all text corpora and NER datasets.

Iii Model

Our neural model consist of two separately trained components: a) foundational word embeddings, modeling the general knowledge from large unlabeled text corpora, and b) task-specific neural networks, modeling the domain knowledge from the labeled training data. In this section, both components are presented briefly.

Word Embeddings

The language model of continuous space word representations (word2vec) [16] and its variations by [17, 18]

are the foundations of most ongoing research in NLP with neural networks. Based on the context, the model embeds words, phrases or sentences into high dimensional vector spaces. In such a space, the semantics of associations of words and phrases are captured to such an extent that algebraic operations lead to meaningful relationships (e.g. 

[16]). This property is immensely useful for our application. We use the model of word2vec and its extension wang2vec [19] which explores syntactic data and, thus, better suites the task of NER.

Neural Model

We give a brief sketch of the neural model LSTM-CRF which we use throughout this paper. The model is similar to the one used in [1], which goes back to the works of [20, 21, 22]. We use a neural model consisting of stacked LSTM and CRF layers. The base layer is made of two parts: (i) a preprocessing sublayer generating the character-based embeddings with a cell of forward and backward LSTMs (biLSTM) [23]

, and the word embeddings from the input sentence, (ii) followed by an encoding sublayer again with a cell of a biLSTM extracting features and generating compressed hidden representations. The

prediction layer is made of CRFs and takes the previous hidden representations to finally produce the Named Entity (NE) tag predictions.

Let be the list of words of a sentence from the input corpus of texts. Furthermore, let be the list of characters of the word consisting of characters with being its th character. For a given word and its NE-tag (gold label) {PER, LOC, ORG, MISC, O} the data flow within the neural network is as follow:

(1)
(2)
(3)
(4)
(5)

where char2vec is a (randomly initialized) lookup table for embedding all characters into a corresponding vector space, and is the concatenation of the embedding vector of word and its character-based hidden representation. The model is trained to predict the NE-tag for each word after seeing the whole input sentence at once.

Iv Experimental Setup

Iv-a Datasets

In order to evaluate our model of Section III for neural NER on German data, we put emphasis on the major datasets of CoNLL (German part) and GermEval. However, more German resources are available that have so far gone unnoticed in the DL community. In Table I, we gather all these NER datasets, which are to date freely accessible, and list them along their number of sentences. Additionally, for each dataset the total number of NE tokens is provided along the four categories from the standards defined in the CoNLL shared task 2003 (CoNLL format). Table I shows that the TüBa-D/Z dataset is the largest of these, both in terms of the number of sentences and of tokens, ideally fitting to the needs of deep neural networks.

Corpus Sent. PER LOC ORG MISC
CoNLL-2003 018,024 08,309 07,864 07,621 04,748
Europarl-2010 004,395 00514 00724 00874 00966
GermEval-2014 031,300 16,204 16,675 12,885 9,254
Europ.Newsp.-2016 008,879 07,914 06,143 02,784 00003
TüBa-D/Z-2018 104,787 55,746 28,582 32,224 12,865
TABLE I: NER Datasets

Preprocessing of Training Data

Apart from CoNLL, most copora had to be further processed to fit the CoNLL format. For GermEval, we consider only the top-level NE, refraining from nested NE to stay in line with the remaining datasets. As a tagging scheme, we preferred the BIO (IOB2) scheme, as it has been shown to perform better [24]. All datasets are given in the BIO scheme, except CoNLL (IOB1) and Europarl (IOB1), which we converted into the target scheme.

For EuropeanaNewspapers, we take the two datasets written in standard German orthography, namely enp_DE.lft.bio and enp_DE.sbb.bio based on historic newspapers from the Dr. Friedrich Tessmann Library and the Berlin State Library, respectively, and omit the Austrian historic newspapers which use a different orthography, differing heavily from the former samples. The original dataset is not provided in the 4-column CoNLL format, which writes each word of a sentence horizontally along its lemma, POS tag and NE-label, and separates each sentence by an empty newline. Therefore, we convert the data into our target format by using spaCy V2.0111http://spacy.io which by its recent release supports preprocessing German texts by providing language models for sentence boundary detection, lemmatization and POS tagging.

For TüBa-D/Z, we extracted the NE-tags from the tuebadz-11.0-conll2010

version. In the case of nested NE, we use a filtering heuristics to extract the longest spanning NE, which allowed us to get more robust training data, not splitting well known entities into parts (e.g. 

[Goethe Universität Frankfurt]_ORG vs. [Goethe]_PER Universität [Frankfurt]_LOC). We converted the tagging scheme of TüBa-D/Z to our target format. Lastly, to allow comparisons with other NER datasets, we mapped the NE category Geo Political Entity (GPE) to LOC.

Data Splitting & Merging

For CoNLL and GermEval we use the splits as provided in the original datasets. Further, we split TüBa-D/Z into train/dev/test sets according to the common ratio of 80/10/10 percentages. Due to the smaller size of the Europarl und the EuropeanaNewspapers datasets, we did not consider them for the first experimental setup of single training, rather we merged them with the training data for the second experimental setup of joint training. For this setup, we aligned all datasets by mapping the NE category OTH to MISC to fit to the CoNLL format. In this way, we generated the currently largest training dataset for German NER of a size of sentences.222CoNLL (12,152) + GermEval (24,000) + Europarl (4,395) + EuropeanaNewspaper (8,879) + TüBa-D/Z (83,832)

Iv-B Word Embeddings

German is a highly inflected language compared to English or Chinese whose syntax is more analytic. For languages like German, the embedding of a single word (e.g. klein) is dispersed across its various morphological and spelling variants (stem: klein kleiner, kleinste, kleine, kleines, kleinen, kleinem, Klein etc.), therefore reducing the number of its samples and weakening its information value if not being lemmatized appropriately. On the other hand, languages with a rather analytical syntax show such morphological variants to a lesser extent, if at all. We assume that this difference is the reason why their embeddings are of higher quality and therefore their performance in downstream tasks is many times higher than in less analytical languages.

Corpus Sentences
Leipzig40-2018 040.00 Mio.
WMT-2010-German 019.36 Mio.
COW-2016 617.28 Mio.
TABLE II: Text Corpora

In order to mitigate this factor for the German language in its negative effect, we are therefore forced to use embeddings of higher quality. In the experimental setup of single training, we tackle this by using more text data. Table II lists the corpora we use for training our word embeddings. Leipzig40-2018 contains the largest possible extract from the so-called Leipzig Corpora Collection in 2018, which was generated by its maintainers on demand for our study, omitting any possible duplicate sentences. To increase the corpus size we combine this extract with WMT-2010-German forming our so-called LeipzigMT corpus. Besides, we consider the COW-2016 corpus, arguably the largest text collection for German. This corpus contains not only a textbook-like language, as found for example in Wikipedia. Therefore, we assume that it fits well with the NER datasets used here, which in turn come from various sources (news, web, wikis, etc.). Both corpora are already preprocessed and split into sentences, containing words, numbers and punctuations. We do not remove punctuation marks, but separate them from words and numbers by surrounding them with spaces to avoid the introduction of variations with punctuation marks. In addition, as a preprocessing step, we write all words in lowercase to account for spelling and morphological variations.

In a third variant of our experiment we deepen the optimization of resources by taking into account lemmatization and POS tagging in connection with writing words in lower case. While lemmatization increases the observation frequency of words, POS tagging allows a more correct specification of their syntactic roles in sentences and consequently differentiates individual observations that are included in the calculation of embeddings. On the other hand, lower case writing of words removes ambiguities, as they are induced in German especially by capitalization at the beginning of sentences. Table III shows the variations we use for this setup.

We apply lemmatization and POS tagging in combination with writing words in lowercase to all resources before they are used in training. These conversions are coupled with an exact conversion of the NER data sets in the respective experiment to avoid mismatches and to increase the overlap with the trained embeddings. Again, we use spaCy for these tasks and use its language models for lemmatization and POS tagging. Listing 1 shows an example of this approach.

raw sentence  : Kleine Kinder sind mutiger.
lemma         : Klein Kind sein mutig .
lemmapos      : Klein_ADJA Kind_NN sein_VAFIN mutig_ADJD ._$
lemmapos_lower: klein_ADJA kind_NN sein_VVFIN mutig_ADJD ._$
Listing 1: Example for Lemma & POS

These conversions are intended to standardize any text input and thus to solve the above-mentioned problems in connection with morphological variations.

Experimental Setup Variant Features
Single Training 1 lower
Joint Training 1 lower
2 lemma
Optimized 3 lemma_lower
Training 4 lemmapos
5 lemmapos_lower
TABLE III: Embedding Variants per Experimental Setups

Iv-C Training Parameters

To remain comparable with the baseline models on CoNLL [1] and GermEval [13], we train the word embeddings with dimension 100333Lample et al. [1] use dimension 100 for English, but 64 for German. We increase this dimension to close the gap., window size of 8 and minimum word count threshold of 4, consequently, setting the LSTM dimension to 100 as well444For word2vec, we performed an extensive search on numerous embeddings with dimension values along with minimum word count threshold and window size values in the range of and , respectively. However, no major differences were observed in the final results.

. We choose dimension 25 for character-based embeddings and the final CRF-layer, and train the network in 100 epochs with a batch-size of 1 and dropout rate of 0.5. As an optimization method, we use the stochastic gradient descent with a learning rate of 0.005. Apart from fitting the LSTM dimension to 300 while using the 300-dimensional pretrained German fastText embeddings

[25]

, the model is fixed throughout our experiments to these settings. Any further sophisticated hyperparameter tuning (e.g. 

Population Based Training) is left for future work.

V Results

In this section, we present the results we obtained for our three experimental settings. As described in [24], we perform every experiment up to 6 times, starting from different random seeds, in order to arrive at significant final values on the respective test dataset. We evaluate the NER results by using the official evaluation script from the shared task of CoNLL 2003. All our experiments were run on Nvidia’s GTX 1080 Ti GPUs.

V-a Single Training

We compare our results with the current top performing models on CoNLL and GermEval. Table IV shows the highest results we achieve on the single training setup (first experimental setting).

Data Embeddings Features F-score [%]
CoNLL pre-trained Leipzig wang2v 78.76 [1]
GermEval pre-trained UKP2014 word2v 75.90 [13]
CoNLL self-trained LeipzigMT wang2v 80.81
CoNLL self-trained COW wang2v 83.29
GermEval self-trained LeipzigMT wang2v 81.97
GermEval self-trained COW wang2v 83.14
TüBa-D/Z self-trained LeipzigMT wang2v 88.95
TüBa-D/Z self-trained COW wang2v 89.26
TABLE IV: Single Training

We achieve an improvement throughout the datasets, outperforming all previous results on German neural NER, and establishing a new state-of-the-art on each of them. Increasing the corpus size by means of the LeipzigMT corpus displays a side-by-side performance increase on the CoNLL baseline. Increasing the corpus size further through the COW corpus gives us finally the best results on CoNLL. From this perspective, looking at the three data points for CoNLL (or GermEval), we observe a logarithmic growth of F-score as a function of the size of the underlying embedding corpus. Even larger corpora than the COW corpus are needed to further support this observation.

On the side of training data, we observe a similar but more powerful behavior. On LeipzigMT, the increase of training data size from CoNLL to GermEval, and then to TüBa-D/Z leads to an improvement of +1.16% and +6.98% in F-score. For COW this behavior re-emerges for TüBa-D/Z, closing the gap to high-resource languages like English, and almost crossing the 90% barrier on TüBa-D/Z. Besides, we see that the larger train dataset TüBa-D/Z does not heavily depend on the corpus size implying that it is beneficial to invest in annotation efforts.

We also find that wang2vec generally performs better than word2vec. This shows that a task-specific embedding algorithm is important (in our case taking into account the syntax for NER).

Last but not least, our experiments show that keeping information about capitalization can even downgrade the quality of word embeddings. Likewise, we observe that integrating capitalization information as an additional input feature to our neural network does not lead to better results. We assume that this is due to the inflectional morphology of German, according to which nouns are capitalized at the beginning, in contrast to English, where mainly proper names (named entities) are written in this way.

V-B Joint Training

As a first step towards joint training, we report the best results for fastText embeddings and compare them to UKP2014 embeddings, only using the two datasets from the baseline models. Next, we approach the full joint setup and perform the training on all German NER datasets. Starting from the results of the last section, we consider only COW for this setup. Table V shows the top results for this setup.

For fastText, we get the best results among all settings we examined (the results on single training were worse than for this setup). However, they are still below the ones with UKP2014, which themselves were trained with the original word2vec model back in 2014. This shows, that the fastText algorithm, being a promising extension of word2vec, does not suit well to our NER task, even though using a more informative vector space with 300 dimensions. Hence, we discard it for further experiments.

For COW, the transfer learning on a single task works well and the performance for CoNLL and GermEval are improved further, lying slightly above the single training values. It can be noted that the final performance is more directed towards the low performing values. We assume that it depends more on the datasets with the lower single training performance (who make with

% a large part of the joint training dataset), as due to the data merging additional variety is introduced to the final training dataset. This makes the tasks more difficult and brings it closer to a real-world scenario. Still, the slightly improved performance indicates that the neural network is generalizing, and successfully performing task-related transfer learning on datasets, i.e. the model is improving the same task on a heterogeneous dataset, given that it performs well on a single large homogeneous dataset.

Overall, the results are promising; they indicate that we have a good candidate for applying a jointly trained tagger to large resources where the availability of labeled data is scarce.

Data Embeddings Features F-score [%]
CoNLL+GermEval pre-trained UKP2014 word2v 78.06
CoNLL+GermEval pre-trained fastText 300dim 77.00
all self-trained COW wang2v 83.47
TABLE V: Joint Training

V-C Resource Optimization via Lemmatization & POS tagging

In this final setup of resource optimization, we examine various constellations. Table VI reports the corresponding list of results.

Data Embeddings Features F-score [%]
LeipzigMT lemma 82.57
LeipzigMT lemma_lower 82.94
LeipzigMT lemmapos 81.22
CoNLL LeipzigMT lemmapos_lower 81.20
COW lemma 83.64
COW lemma_lower 83.14
COW lemmapos 82.38
COW lemmapos_lower! 82.47
LeipzigMT lemma 82.53
LeipzigMT lemma_lower 82.47
LeipzigMT lemmapos 81.46
GermEval LeipzigMT lemmapos_lower 81.05
COW lemma 82.87
COW lemma_lower 82.53
COW lemmapos 81.96
COW lemmapos_lower! 81.38
LeipzigMT lemma 88.50
LeipzigMT lemma_lower 88.27
LeipzigMT lemmapos 87.85
TüBa-D/Z LeipzigMT lemmapos_lower! 87.83
COW lemma 89.08
COW lemma_lower 89.24
COW lemmapos 88.43
COW lemmapos_lower 88.02
TABLE VI: Optimized Training via Lemma & POS

Intuitively, using POS tagged sentences for training word embeddings may appear to be unusual, however, the results show a different picture. We get results very close to the top performances of the previous sections. A common pattern across all experiments can be detected. The variation of lemmatization on COW constantly delivers top scores for the three major datasets, and even produces the highest value for CoNLL across all setups. Lemmatization performs comparatively better than lemmatization combined with POS tagging. This shows that dispersing the semantics of a given word across various roles it can take does not improve the quality of the final embeddings. Rather it is better to decrease the (redundant) varieties in the vector space by assembling in advance all morphological variants to a common base form, which only then is mapped to a common semantic vector. After lemmatization is performed, we can see that lower casing does not lead to a notable improvement. We assume that lemmatization already performs a good filtering of the raw text, making lower casing almost ineffective.

Regarding the size of the corpus used for generating the word embeddings, we come to the conclusion, that lemmatization and POS tagging reduce the performance differences from previous sections which depended so far on the latter size. This confirms our assumption that the word2vec algorithm in its original form does not suit well to morphological rich languages. The results of this setup show that the values for LeipzigMT and COW now lie closer to each other, making the performance to some extent independent from the size of the embedding corpus. This is an important finding, giving rise to promising opportunities and applications for low-resource languages.

Vi Conclusion & Future Work

In this paper, we performed a far reaching study on neural NER by example of a low-resource language like German. The study focused on a monolingual experimental setup. Nevertheless, the improved results pave the way for related languages with similar characteristics as German.

There are various ways to improve existing neural models. Instead of just designing deeper and wider hybrid models, we showed the high importance of gathering and merging resources and how their careful optimization can eliminate the lack of progress. In particular, we found out that increasing the size and improving the quality of raw corpora for word embeddings by applying morphological processing like lemmatization & POS tagging leads to meaningful improvements. In addition, we demonstrated the effect of transfer learning by merging data sets for a joint training setup, which also produced good results and makes this approach a promising candidate for NER applications in the area of scarce resources of annotated data sets.

Overall, we conducted the first comprehensive research for the German NER on all existing training data sets and resources, including the study of common pre-trained embeddings such as fastText. In this context, we established a new state-of-the-art using all open source data sets for the German NER, which exceeds the 80% F-score limit for the German NER and closes the gap to other high-resource languages such as English.

For future work we plan to further refine the training process of word embedding and in particular to investigate how the performance of downstream tasks can become more independent of the size of embedding corpora using linguistic methods such as lemmatization and POS tagging. To this end, we intent to examine the recently published ELMo embeddings [26] for German. Finally, we will examine the role of the multilingual COW corpus for word embedding by example of other languages such as Dutch, French, Spanish and English.

Acknowledgment

This work was funded by the German Research Foundations (DFG) as part of the BIOfid project (DFG-326061700). We plan to upload our source code and the trained embeddings on GitHub for the research community. Special thanks goes to G. Lample for his directions on the procedure for training the embeddings, and to Prof. G. Heyer and F. Helfer for providing the extract of Leipzig40-2018 corpus.

References

  • [1] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural Architectures for Named Entity Recognition,” in Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Association for Computational Linguistics, 2016, pp. 260–270.
  • [2] L. Ouyang, Y. Tian, H. Tang, and B. Zhang, “Chinese Named Entity Recognition Based on B-LSTM Neural Network with Additional Features,” in International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage.   Springer, 2017, pp. 269–279.
  • [3] D. Goldhahn, T. Eckart, and U. Quasthoff, “Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages,” in LREC, 2012.
  • [4] C. Callison-Burch, P. Koehn, C. Monz, K. Peterson, M. Przybocki, and O. F. Zaidan, “Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation,” in Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR.   Association for Computational Linguistics, 2010, pp. 17–53.
  • [5] R. Schäfer, “Processing and querying large web corpora with the COW14 architecture,” in Proceedings of Challenges in the Management of Large Corpora 3 (CMLC-3), P. BaÅ„ski, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, and A. Witt, Eds., UCREL.   Lancaster: IDS, 2015.
  • [6] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the conll-2003 shared task: Language-independent named entity recognition,” in Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4.   Association for Computational Linguistics, 2003, pp. 142–147.
  • [7] D. Benikova, C. Biemann, and M. Reznicek, “Nosta-d named entity annotation for german: Guidelines and dataset,” in LREC, 2014.
  • [8]

    M. Faruqui and S. Padó, “Training and Evaluating a German Named Entity Recognizer with Semantic Generalization,” in

    Proceedings of KONVENS 2010, Saarbrücken, Germany, 2010.
  • [9] C. Neudecker, “An Open Corpus for Named Entity Recognition in Historic Newspapers,” in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), N. C. C. Chair), K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, Eds.   Paris, France: European Language Resources Association (ELRA), may 2016.
  • [10] H. Telljohann, E. W. Hinrichs, S. Kübler, H. Zinsmeister, and K. Beck, “Stylebook for the Tübingen treebank of written German (TüBa-D/Z).”
  • [11] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
  • [12] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2383–2392.
  • [13] N. Reimers, J. Eckle-Kohler, C. Schnober, J. Kim, and I. Gurevych, “GermEval-2014: Nested Named Entity Recognition with Neural Networks,” in Workshop Proceedings of the 12th Edition of the KONVENS Conference, G. Faaß and J. Ruppenhofer, Eds.   Universitätsverlag Hildesheim, October 2014, pp. 117–120.
  • [14] D. Gillick, C. Brunk, O. Vinyals, and A. Subramanya, “Multilingual language processing from bytes,” in HLT-NAACL, 2016.
  • [15] B. Klimek, M. Ackermann, A. Kirschenbaum, and S. Hellmann, “Investigating the Morphological Complexity of German Named Entities: The Case of the GermEval NER Challenge,” in Language Technologies for the Challenges of the Digital Age, G. Rehm and T. Declerck, Eds.   Cham: Springer International Publishing, 2018, pp. 130–145.
  • [16]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [17] O. Levy and Y. Goldberg, “Dependency-Based Word Embeddings.” in ACL (2), 2014, pp. 302–308.
  • [18] A. Komninos and S. Manandhar, “Dependency Based Embeddings for Sentence Classification Tasks.” in HLT-NAACL, 2016, pp. 1490–1500.
  • [19] W. Ling, C. Dyer, A. Black, and I. Trancoso, “Two/Too Simple Adaptations of word2vec for Syntax Problems,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.   Association for Computational Linguistics, 2015.
  • [20] J. P. C. Chiu and E. Nichols, “Named Entity Recognition with Bidirectional LSTM-CNNs,” TACL, vol. 4, pp. 357–370, 2016.
  • [21] Z. Huang, W. Xu, and K. Yu, “Bidirectional LSTM-CRF Models for Sequence Tagging,” CoRR, vol. abs/1508.01991, 2015.
  • [22] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,”

    Journal of Machine Learning Research

    , vol. 12, no. Aug, pp. 2493–2537, 2011.
  • [23]

    A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in

    Acoustics, speech and signal processing (icassp), 2013 ieee international conference on.   IEEE, 2013, pp. 6645–6649.
  • [24] N. Reimers and I. Gurevych, “Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging,” in EMNLP, 2017.
  • [25] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017.
  • [26] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of NAACL, 2018.