An Evaluation of Recent Neural Sequence Tagging Models in Turkish Named Entity Recognition

05/14/2020 ∙ by Gizem Aras, et al. ∙ Istanbul Technical University MEF Üniversitesi 0

Named entity recognition (NER) is an extensively studied task that extracts and classifies named entities in a text. NER is crucial not only in downstream language processing applications such as relation extraction and question answering but also in large scale big data operations such as real-time analysis of online digital media content. Recent research efforts on Turkish, a less studied language with morphologically rich nature, have demonstrated the effectiveness of neural architectures on well-formed texts and yielded state-of-the art results by formulating the task as a sequence tagging problem. In this work, we empirically investigate the use of recent neural architectures (Bidirectional long short-term memory and Transformer-based networks) proposed for Turkish NER tagging in the same setting. Our results demonstrate that transformer-based networks which can model long-range context overcome the limitations of BiLSTM networks where different input features at the character, subword, and word levels are utilized. We also propose a transformer-based network with a conditional random field (CRF) layer that leads to the state-of-the-art result (95.95% f-measure) on a common dataset. Our study contributes to the literature that quantifies the impact of transfer learning on processing morphologically rich languages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Named entity recognition (NER) aims to recognize named entities in a given text by determining their boundaries and classifying them into predefined categories (e.g., person, location, and temporal expression). NER is a crucial step in various natural language processing applications such as event extraction 

(Chen:15) and question answering (Molla:06) as well as in big data analytics (Saju:17)

. Early studies have addressed the recognition of named entities as a sequence labeling problem and extensive research efforts have been devoted to developing solutions using machine learning techniques 

(Lin:06; Ekbal:08)

, hidden markov models 

(Zhou:02), and conditional random fields (Yao:09; Zirikly:15). Recently, neural models have been introduced to named entity task in well-formed and noisy texts (Alnabki:20). In spite of recent advances, NER remains to be a challenging problem due to several reasons such as the recognition of overlapping or nested entities, infrequent entities in user generated noisy texts, and semantically ambiguous entities in different contexts.

In the current era, the amount of online content has exploded which makes it exhaustive to search from a vast distributed source of information. Search tools or expert systems might effectively alleviate the problem of accessing available content on the web. However, continuous alteration of natural languages due to heavy social media usage, social-cultural factors in society, daily events (e.g., political changes and major sport events) has reflections in written texts and leads to constant evolution of words, expressions and importantly named entities. Correctly identified named entities from unstructured or semi-structured content form a basis for the development of more effective and intelligent information management, text mining, and relation extraction systems (Marrero:13). For instance, mining daily news content by digital media applications for extracting information about a person or a location necessitates querying an astonishing amount of news articles which can be facilitated by automatic detection of named entities in written texts. Paving the road for interpretable and reusable information through semantically annotated online content can also be listed as a particular benefit of extracting named entities and their relations from raw texts.

NER is a well-studied task for several languages including Turkish and recent successes in neural architectures have greatly advanced achieved performances on recognizing Turkish named entities (Gunes:18; Gungor:19). In these studies, Bidirectional Long Short Term Memory networks with different word representations were widely used and evaluated on a common dataset consisting of person, location, and organization names (Tur:03). A conditional random field (CRF) was shown to positively contribute to these networks that minimize the need for feature engineering. There is a recent interest in applying deep bidirectional transformers (Stefan:20) and transfer learning (Akkaya:20) to Turkish entity tagging. In this work, we present a comprehensive evaluation of two notable neural architectures, namely BiLSTM networks and Transformer-based networks and compare their performances in the same experimental setting. In BiLSTM models, we explore different combinations of four kinds of embeddings as input (i.e., character, morphological, subword, and word embeddings) and experiment with different pretrained embeddings as initializations of word embeddings. In transformer-based models, we benefit from three different transformer based language models, namely multilingual cased BERT (mBERT), Turkish BERT (BERTurk), and XLM-RoBERTa (XLMR), and study the effectiveness of both linear and CRF layers at the top of the network. As our second contribution, we propose a transformer-based neural architecture accompanied with a CRF as the top layer (an extension of the BERTurk model) which sets the new state-of-the-art f-measure of 95.95%. Our study not only extends the current Turkish NER literature but also validates the usability of transfer learning on processing a morphologically rich language.

The rest of this article is organized as follows. Section 2 discusses related research on named entity recognition with a particular focus on Turkish NER studies. Section 3 describes neural architectures utilized in this work. Section 4 presents our dataset and parameter initializations used for building neural architectures. Section 5 discusses conducted experiments and the results that we obtained. Finally, Section 6 concludes the article and presents our future work.

2 Literature Review

Study Approach Embedding F1 Score
Ma:16 CNNChar-BiLSTM-CRF - 80.76
Collobert:11 Tanh-CRF - 81.47
Huang:15 BiLSTM-CRF - 84.26
Huang:15 BiLSTM-CRF Senna 84.74
Ma:16 CNNChar-BiLSTM-CRF Skip-Gram 84.91
Collobert:11 Tanh-CRF Senna 88.67
Huang:15 BiLSTM-CRF Senna 88.83
Ma:16 CNNChar-BiLSTM-CRF Senna 90.28
Lample:16 LSTMChar-BiLSTM-CRF Skip-Gram 90.96
Ma:16 CNNChar-BiLSTM-CRF Glove 92.21
Akbik:18 Flair-Char-BiLSTM-CRF Glove 93.09
Table 1: English NER Studies

2.1 Neural Models for Named-Entity Recognition

Earlier traditional named entity recognition systems have relied heavily on feature engineering and employed hand-crafted language dependent features, large gazetteers, and tagged datasets (Collobert:11). A significant branch of research has utilized a range of statistical approaches to address the problem such as maximum entropy classifiers (Chieu:03)

, decision trees 

(Paliouras:00), and conditional random fields (Finkel:09)

. However, in recent years, the focus of NER research has shifted to neural models in parallel with observed improvements on multiple language processing benchmarks such as question-answering and language generation. Neural NER models have been guided by distributional approaches where the meaning of a word is carried in its surroundings via vector representations 

(Harris:54; Firth:57; Mikolov:13-2). Initial attempts considered words as separate tokens and represented each token using a fixed-length vector (Mikolov:13; Pennington:14). Some other studies explored different ways of representing words such as concatenating embeddings of characters (Santos:15), morphemes (Luong:13)

, or other word subparts to fixed-length word embeddings. In recent NER studies, the problem was formulated as a sequence labeling task and different Seq2Seq models were shown to achieve state-of-the art results where final embeddings of words are encoded by gated recurrent units (GRUs) or long short term memory units (LSTMs) 

(Lample:16; Ma:16; Chen:18)

. Using conditional random fields on top of neural networks were proved to work equally well 

(Collobert:11; Huang:15; Chiu:16) or better than previous methods. Moreover, BiLSTM-CRF models with character and word embeddings were shown to be effective for multiple languages including Chinese (Zhang:19) and arguably considered as a base model for tagging (Jurafsky:08). Unfortunately, a word, no matter in which context it appears, is represented with the same final embedding in these models. A recent study has utilized contextual string embeddings (Akbik:18), where final word embeddings are contextualized according to the entire sentence. In that study, all characters in the sentence up to the last character of a word were processed via a forward LSTM and all characters from the end to the beginning of the sentence were processed via a backward LSTM. The obtained hidden states were then concatenated to produce final embedding of the word in focus. This kind of word embeddings was proved to improve not only NER tagging but also other sequence labeling tasks such as part-of-speech labeling and phrase chunking. Achieved performance scores (f1 scores) of some these English NER studies are given in Table 1.

Transformer-based approaches outperformed the state-of-the-art on several NLP tasks and achieved performance improvements that might be attributed to the use of attention-mechanism (Vaswani:17). Bidirectional Encoder Representations from Transformers (BERT) (Devlin:18) is a bi-directional transformer that learns contextualized input representations. BERT is different from earlier work in four aspects. First, it uses transformers (Vaswani:17) instead of LSTMs to encode inputs. Second, its training objective is masked language modelling and hence instead of predicting the next word, BERT predicts a randomly masked word from a given sentence. Third, BERT uses subword tokens instead of word tokens, thus some infrequent words get eliminated and their common sub-parts are utilized111Word sub-parts were shown to reduce data sparsity problem in morphologically rich languages such as German (Kudo:18a), and Turkish (Akkaya:20). Finally, pre-trained language model can be fine-tuned for a specific language task at hand by adding one last layer on top of the utilized neural architecture. Thereafter, multilingual BERT (mBERT) was released to support many languages in a single model. However, some research studies demonstrated that BERT trained for a single language outperforms mBERT in several tasks such as dependency parsing and natural language inference (Martin:19). Robustly optimized BERT pretraining (Liu:19)

demonstrated that longer training with careful hyperparameter selection could achieve better results as compared to earlier studies. Another transformer XLM-RoBERTa (XLMR) combined robustly optimized BERT pretraining approach with cross lingual language pretraining 

(Lample:19), while using a larger dataset, and outperformed mBERT in most tasks. In other studies, transformer based architectures were both explored with and without the addition of a CRF layer. For instance, named entity recognition in Slavic languages (Arkhipov:19) and in Portuguese were confirmed to be improved once a trained BERT model is accompanied with a CRF layer (Souza:19).

Study Approach F1 Score
Kuru:16 LSTM 91.30
Demir:14 Reg. Avg. Percp. 91.85
Gungor:19 BiLSTM-CRF 92.93
Gunes:18 Deep-BiLSTM 93.69
Stefan:20 BERT 95.40
Table 2: Turkish NER Studies

2.2 Turkish Named Entity Recognition

Turkish is an agglutinative language with rather complex morphotactics where a lot of information is encoded (such as syntactic roles and relations of words) in morphology. Several Turkish words can be derived by appending multiple suffixes (i.e., inflectional and derivational) to a nominal or verbal root, as often seen in other morphologically rich languages such as Finnish, Hungarian, and Czech. The morphological structure of a Turkish word can be represented as a sequence of inflectional morphemes (IG) separated by derivation boundaries(ˆDB). Each IG sequence has its own part of speech (POS) and inflectional features, and the beginning of a new sequence is marked by a derivation boundary where a change in part of speech occurs. A word might have multiple such representations due to morphological ambiguity. For instance, the following is one possible representation of the word “haberleşmeliyiz” (we should communicate) where the first IG represents that the root is a verb and it is transformed into a noun with the addition of the 2 infinitive suffix “-me":

haberleş+Verb+Pos

^DB+Noun+Inf2+A3sg+Pnon+Nom

^DB+Adj+With

^DB+Noun+Zero+A3sg+Pnon+Nom

^DB+Verb+Zero+Pres+A1pl

Although surface forms are constrained by morphological rules (e.g., vowel harmony and vowel drops) (Oflazer:94), the number of derived words from a single root is still very large to be handled easily and lexical sparsity is often experienced in learning-based NLP applications. For instance, in a Turkish dataset of 10 million words, the vocabulary size is measured as 474,957 whereas that number is lowered to 97,734 unique words in an English dataset of the same size (Hakkani:00). However, the vocabulary size is observed to degrade to 94,235 unique words once the root forms of Turkish words are considered over the same dataset. As a common practice to handle data sparsity, Turkish NLP studies often utilize disambiguated morphological representations of words rather than their surface forms.

Named Entity Recognition in Turkish has been studied for many years (Kucuk:17). The first statistical Turkish NER study (Tur:03) trained an HMM model to tag person, location, and organization names that appear in well-written texts by leveraging morphological, lexical and contextual information of words. In another study (Kucuk:10), a rule-based approach was explored where knowledge resources such as dictionaries of person and location names, and pattern extraction rules for temporal and numeric expressions are heavily utilized. The system was then enriched with an ability to learn knowledge resources from annotated data. A CRF based NER system (Yeniterzi:11) highlighted the impact of morphology on tagging process and benefited from roots and morphological features of words as separate tokens instead of words. An automated rule learning system (Tatar:11), a CRF based system relying on the use of gazeteer and hand crafted morphology dependent features (Seker:12), and a classification system where six different models are trained with both discrete and continuous features of words (Ertopcu:17) are among recent Turkish NER studies. Although we use the same dataset for training and testing purposes (Tur:03), our work utilizes a neural network based solution and hence significantly differs from these earlier rule-based or statistical approaches.

The first neural network based study (Demir:14)

used a regularized average perceptron algorithm and combined continuous vector representations of words and some language independent features (such as context, previous tags, and case features) in a semi-supervised fashion. The use of character embeddings rather than word embeddings was later explored in a stacked bidirectional LSTM network

(Kuru:16)

. For each input character, the system outputs a tag probability and a Viterbi decoder converts character-level probabilities to word-level tag probabilities. The results demonstrated that a good tagging performance could be achieved without benefiting from an extensive list of word features and language dependent knowledge resources. The current state-of-the-art systems utilize bidirectional LSTM networks and experiment with different word representations. The first BiLSTM study

(Gungor:19) concatenated word, character, and morphological embeddings as encoder inputs and used a CRF layer on top of the decoder. The tagging model was tested on four other morphologically rich languages (i.e., Czech, Hungarian, Spannish, and Finnish) and the results demonstrated that word representations once augmented with morphological and character embeddings achieve the highest performance. On the other hand, the second BiLSTM study (Gunes:18) combined word embeddings, writing style embeddings (e.g., all in uppercase letters or in sentence case letters) as input representations, and experimented with stacked layers of varying depth. There is only one work where deep bidirectional transformers were utilized (Stefan:20), and in that study both cased and uncased BERT models were evaluated on Turkish NER task. The performances of these systems are listed in Table 2. Our work lies on the path opened by these BiLSTM studies where different embeddings are learned and sequentially encoded by LSTMs. However, to the best of our knowledge, this work is the first Turkish NER study that compares language models learned by transformers with BiLSTM models in the same experimental setup. Moreover, our work explores the impact of context on task performance by exploring both context sensitive and insensitive word embeddings.

In recent years, another branch of Turkish NER studies has focused on noisy data specifically from social media. Although a limited number of approaches (Celikkaya:13; Eken:15; Okur:16; Akkaya:20)

have provided different solutions to the task, they all continuously improved on the state of the art. The current state of the art with an f-score of 67.39% is still behind the observed performances on clean data.

3 System Architecture

Named entity recognition is a labeling task over a text that consists of a sequence of words, and hence any approach that tags every single word with a label from a predetermined set would be a reasonable solution. In this work, we address the task as a sequence to sequence (Seq2Seq) learning problem and build two different architectures for tagging. The first architecture utilizes a Bidirectional Long Short Term Memory (BiLSTM) network whereas the second architecture uses a Transformer-based neural network. A CRF layer is employed on top of these architectures as an optimization layer for predicting the best label sequence. Our study has similarities with some earlier works (Gunes:18; Gungor:19; Akkaya:20), but the main difference comes from the utilization of a context-sensitive language model and its performance comparisons with well-studied LSTM based language models.

3.1 BiLSTM Network

BiLSTM architectures utilize two separate LSTM networks (Hochreiter:97)

, a specialized form of recurrent neural networks that can cope with vanishing gradient problem. The first LSTM network processes input in the forward direction to keep a history from the beginning of the sequence whereas the second LSTM network processes all words in the sequence starting from the end of the input.

Figure 1: BiLSTM-CRF Architecture.

Our problem is formulated as given an input sentence S={s,s,…,s} consisting of n words, obtain a sequence of labels L={l,l,…,l} such that each l is from a set of NER tags. As shown in Figure 1, a word that appears in the sequence is encoded into an embedding (x) and then fed to the network. In the current implementation, each word is encoded by a combination of four different embeddings:

  • Word embedding: Vector representation of the word as a token ()

  • Subword embedding: Vector representation of the word chunk as a token ()

  • Character embedding: Vector representation of the word at character-level ()

  • Morphological embedding: Vector representation of the word at morphological-level ()

We use a context-insensitive language model to obtain the word embedding () of each token, where every word in the sequence is taken as a single token. This embedding neither captures the location of the word in the sequence nor the contextual content of the input. We obtain subword, character, and morphological embeddings of words using distinct BiLSTM networks. For instance, the network that produces character embeddings processes every character in a word as a separate token, as shown in Figure 2-a. On the other hand, morphological BiLSTM network with a similar architecture produces embeddings to reflect morphological subunit information of each word in the sequence. Subword embeddings exploit the highest possible similarity between different words. These four kinds of embeddings are utilized in order to capture the morphologically-rich nature of Turkish and information encoded in terms of characters, morphemes, and word chunks. Separate embeddings allow us to explore different ways of concatenating them to obtain the final input embedding used by our architecture. For instance, model shown in Figure 2, concatenates word, character, and morphological embeddings in order to obtain the final input word embedding (i.e., ).

Figure 2: a) Character Embedding and b) Input Embedding of the Word “dedi"(he/she said).

The computations performed in our architecture with LSTM cells are as follows:

(1)

where is the hidden state, is the cell state, and is the input at time , and , , , and are the input, forget, cell, and output gates, respectively.

3.2 Transformer-Based Network

Transformer-based language models replace recurrent neural network cells with self attention and fully connected layers. As a result, the content of a whole sentence and the location of each word in the sentence are effectively captured to encode contextual information and long-range dependencies. Conditioning on both the left and right contexts of a word results in dissimilar encodings for the same word in different sentences. Moreover, these models enable us to benefit from shared embeddings between multiple natural languages and subword units in monolingual settings. In this architecture, we use pretrained masked language models and fine tune them on the NER task. As show in Figure 3, the input sequence is first tokenized into subword units and then fed to the network.

3.3 CRF Layer

The CRF layer is utilized as the top hidden layer in both architectures. This layer takes the concatenation of last hidden states from the underlying network. Its role is modeling the joint probability of the entire label sequence, in order to impose constraints over neighbour tokens (Lafferty:01). A standard implementation is carried out (Zhang:19) and the probability of a label sequence is calculated as follows:

(2)

where represents an arbitrary label sequence, and is a model parameter specific to , and is a bias specific to and .

Figure 3: Transformer-Based Network.

For decoding, a first-order Viterbi algorithm is used to find the most probable label sequence over the input sequence, and sentence-level log-likelihood loss with regularization is used to train the model:

(3)

where is a set of manually labeled data, is the regularization parameter, and is the parameter set.

4 Experimental Setup

4.1 Dataset

The dataset used in this study (Tur:03) is a collection of articles from the national newspaper Milliyet, covering a period between 1 January 1997 and 12 September 1998. The dataset contains Turkish sentences tagged with BIO2 scheme in CoNNL format and morphological analyses of all sentence tokens. For instance, the sentence from the dataset “Meliha Düzağaç’ın resimleri 7 Ekim’e dek Ankara TCDD Sanat Galerisi’nde sergilenecek." (Meliha Düzağaç’s paintings will be exhibited at Ankara TCDD Arts Gallery until 7th of October.) is tagged as follows:

Meliha B-PERSON

Düzağaç’ın I-PERSON

resimleri O  

7 O

Ekim’e O

dek O

Ankara B-ORGANIZATION

TCDD I-ORGANIZATION

Sanat I-ORGANIZATION

Galerisi’nde I-ORGANIZATION

sergilenecek O

. O 

In BIO2 scheme tagging, the first token of a named entity of a particular type (type) is labeled with beginning type tag (B-type), and the remaining tokens of the same named entity are labeled with the inside type tag (I-type). All other tokens that do not belong to a named entity are labeled with outside type tag (O). In this work, we split the dataset into a training set of 32,171 sentences, 20% of which is reserved as validation set, and a test set of 3328 sentences.

4.2 Building BiLSTM Networks

Name Training Method Dataset Size Vocabulary Size Dimension Window Size Negative Sampling
Hur Skip-gram 170M 500 K 128 1 2
Huaw Skip-gram 941 M 1.2 M 300 5 10
FastText CBOW - 2M 300 5 10
Random Randomly initialized - 1.2 M 300 - -
Table 3: Sets of Word Embeddings

To generate input embeddings of the encoder, we first obtain vector representations of each token in our dataset. For character, morphological, and subword embeddings, vectors are randomly initialized whereas four different initializations are experimented in word embeddings (Table  3):

To form character embeddings, a random embedding is initialized for each character. Then these embeddings are fed into character-LSTM, that encodes sequences of characters. The embedding for a word derived from its characters is the concatenation of its forward and backward representations from the bidirectional character-LSTM (Figure 2-a). Since our dataset also contains morphological analysis of each word, morphological embeddings are also utilized. Character-level morphological embeddings are used, as this representation was shown to work best in previous work (Gungor:19). Morphological embeddings have the same architecture as character embeddings, except they encode the full morphological analysis of the given word instead of the word itself (Figure 2-b). Word chunks used in subword embeddings are obtained via a unigram SentencePiece (Kudo:18) tokenizer444Using https://github.com/google/sentencepiece

. The SentencePiece tokenizer is trained using a news archive of 14,995,202 tokens which consists of articles published in Hürriyet newspaper from 22 November 2018 to 22 November 2019. The unigram tokenizer, that we refer to as Turkish SentencePiece (TR SentPiece) tokenizer, has a vocabulary of size 50,000 tokens. In our experiments, we use an embedding size of 300 for words and subwords whereas 200 for characters and morphological units. Using different combinations of these embeddings, we train several BiLSTM networks using stochastic gradient descent optimizer with an initial learning rate of 0.05 (some of which are listed in Table 

5

). In these trainings, gradient clipping of 0.5 is used and dropout is applied to concatenated embeddings with the probability of 0.5. Each models is trained for 50 epochs and 0.9 momentum is used within the optimizer. Moreover, learning rate decay is applied at the end of every epoch using the following function:



4.3 Building Transformer-Based Networks

To build our transformer-based networks, we utilize pretrained language models multilingual cased BERT (mBERT), Turkish BERT (BERTurk)555https://huggingface.co/dbmdz/bert-base-turkish-cased, and XLM-RoBERTa (XLMR). For each model, we experiment with two different settings:

  • The model is followed with a linear layer and cross-entropy is used as loss function

  • The model is followed with a CRF layer and negative log is used as loss function

In both settings, finetuning is applied to language models and sentences are tokenized by default tokenizers. However, in the first setting, subwords that do not appear in the first position of words are treated as padding tokens in loss calculations of training. In the evaluation phase, a BIO2 tag is assigned to only first subword token of a word and the remaining subwords of a word (treated as padding in training) are labeled with the same tag. It is worth mentioning that default tokenizers provided with language models produce different subword tokens for the same sentence. For instance, outputs produced by all tokenizers used in this study for the sentence “Meliha Düzağaç’ın resimleri 7 Ekim’e dek Ankara TCDD Sanat Galerisi’nde sergilenecek.” are as follows:


Morphological Analysis666 + and ++ represent inflectional and derivational suffixes, respectively.:
["Meliha", "Düzağaç", "’", "+ın", "resim", "+ler", "+i", "7", "Ekim", "’", "+e", "dek", "Ankara", "TCDD", "Sanat", "Galeri", "+si", "’", "+n" , "+de", "ser", "++gi", "++len", "+ecek", "."]

BERTurk Tokenizer:
["Melih", "##a", "Düz", "##ağaç", "’", "ın", "resimleri", "7", "Ekim", "’", "e", "dek", "Ankara", "TCDD", "Sanat", "Galerisi", "’", "nde", "sergilen", "##ecek", "."]

mBERT Tokenizer:
["Mel", "##ih", "##a", "D", "##üz", "##a", "##ğa", "##ç", "’", "ın", "res", "##im", "##leri", "7", "Ekim", "’", "e", "dek", "Ankara", "TC", "##D", "##D", "Sanat", "Gale","##risi", "’", "nde", "ser", "##gile", "##nec", "##ek", "."]


XLMR Tokenizer:
["   Meli", "ha", "   Düz", "ağa", "ç", "’", "ın", "   resim", "leri", "   7", "   Ekim", "’", "e", "   de", "k", "   Ankara", "   TC", "DD", "   Sanat", "   Galeri", "si", "’", "nde", "   sergi", "lenecek", "."]]

TR SentPiece Tokenizer:
["   Melih", "a", "   Düz", "ağaç", "’", "ın", "   resimleri", "   7", "   Ekim", "’", "e", "   dek", "   Ankara", "   TCDD", "   Sanat", "   Galerisi", "’", "nde", "   sergilenecek", "."]

In our preliminary experiments, we observe that tokens produced by multilingual BERT tokenizer do not correlate well with morphological units given in the dataset. Although this is not the case for other tokenizers, BERTurk has a small vocabulary size and XMLR is not trained solely for Turkish language. Due to these reasons, we do not report any results on the use of default tokenizers in BiLSTM networks. Moreover, in our preliminary experiments, we observe around 20%-40% mismatches between subword tokens obtained by our trained Tr SentPiece tokenizer and vocabularies used in pretrained models. Thus, we do not report results regarding the use of Tr SentPiece tokenizer in any of our transformer-based networks. HuggingFace transformers library777https://github.com/huggingface/transformers

with PyTorch

(Wolf:19) is used for implementations. Networks are trained by Adam optimizer with fixed weight decay, with initial learning rate of , and gradient clipping of 1. Table 4 provides details of all language models used in transformer-based networks.

Model # Hidden # Hidden # Attention Vocabulary
Layers Units Heads Size
mBERT 12 768 12 119,547
BERTurk 12 768 12 32,000
XLMR-b 12 768 12 250,002
XLMR-l 24 1024 16 250,002
Table 4: Settings of Masked Language Models

4.4 Evaluation Metrics

In this study, the evaluation scores are reported using standard CoNNL888The Conference on Natural Language Learning that is organized by SIGNLL (ACL’s Special Interest Group on Natural Language Learning). precision, recall, and F1 metrics. The boundaries of all entities in a test sentence are determined by grouping tokens that form a single entity (i.e., a token sequence with B- and I-tags) and scores are computed at the entity-level. The library seqeval999https://pypi.org/project/seqeval/ is used to compute F1 scores using the formula shown below:

(4)

5 Results and Discussion

Model # Model Description Embedding ValidTest F1 Precision Recall Accuracy Trn. Time
1 Word-Char-BiLSTM-CRF Random Valid 85.75 84.41 87.15 97.84 11:08:25
Test 85.20 84.10 86.33 97.76
2 Word-Char-BiLSTM Huaw Valid 86.08 83.59 88.72 98.21 02:07:40
Test 85.28 83.06 87.62 98.16
3 Subword-Char-BiLSTM-CRF Random Valid 86.26 85.29 87.25 97.09 06:28:04
Test 86.37 84.93 87.85 97.10
4 Word-Char-BiLSTM-CRF Hur Valid 87.21 86.14 88.19 98.04 01:59:36
Test 87.92 87.30 88.56 98.09
5 Word-BiLSTM-CRF Huaw Valid 89.10 89.77 88.44 98.36 01:12:20
Test 88.70 89.70 87.73 98.26
6 Word-Char-BiLSTM-CRF FastText Valid 89.44 88.36 90.55 98.38 02:15:58
Test 89.99 89.39 90.60 98.41
7 Word-Char-Morph-BiLSTM-CRF Huaw Valid 91.52 90.58 92.48 98.70 05:17:29
Test 91.65 91.38 91.92 98.71
8 Word-Char-BiLSTM-CRF Huaw Valid 91.57 90.64 92.52 98.72 02:00:39
Test 91.84 91.17 92.52 98.78
Table 5: Performance Scores of BiLSTM-CRF Models on Our Validation and Test Sets

The literature on Turkish NER studies has benefited from BiLSTM neural networks and transformer-based networks on different settings. Although a dataset is common to all these studies, they have various similar and dissimilar design considerations, parameter settings, and initializations in their architectures. In this work, we not only provide the most comprehensive performance evaluations that compare two different architectures on the same experimental setup but also report the impact of some design choices that have not been explored before in these architectures. Following an ablation study, we present our findings and quantify the strength of effect of each design consideration in focus on different architectures. Finally, we contribute to the literature by introducing a transformed-based model with a CRF layer at the top and demonstrate that this model outperforms the current state-of-art Turkish NER studies.

5.1 Experiments on BiLSTM CRF Networks

We built several BiLSTM models using different configurations and conducted experiments to assess the impact of a single design parameter at each turn. Table 5 presents the performance scores of some models (in increasing order of F1 scores) on validation and test sets, respectively. We choose these models since they reflect the general tendency of varying parameters between models.

It is our observation that previous Turkish research has spent tremendous effort to find the best way of forming input word embeddings and explored different combinations of vectors that represent word tokens from different perspectives. Character and morphological embeddings were shown to have a positive effect on the performance of BiLSTM networks (Gungor:19; Akkaya:20). However, to the best of our knowledge, no previous research has measured the contribution of subword information in encoding. A character sequence of a word is often longer than its subword sequence and longer sequences present significant modeling challenges for Seq2Seq models. Moreover, subword representations result in modest vocabulary size and have the potential to form basis for robust feature representations once accompanied with character-based representations. Thus, we argue that unexplored effect of subword representations on BiLSTM performance is worth studying.

The comparison between Model 1 and Model 3 shows a slight performance increase of 0.51% on the validation set and 1.17% on the test set when subword embeddings are used instead of word embeddings. The observed increase might stem from a shorter vocabulary size (reduced data sparsity) which circumvents the problem of out-of-vocabulary words up to a level. However, the rise is not that significant as we expected. This might be due to the presence of character embeddings which might efficiently encode information carried in suffixes and thus surpass advantages of subword units. Although average scores over 5 different runs are reported, one particular reason for performance differences might be the fact that randomly initialized word or subword embeddings converge differently at each run. Additionally, performance differences are observed on training and validation sets over different epochs as shown in Figure 4.

Figure 4: Performance Comparisons of Word and Subword Embeddings During Training and Validation.

Our second set of experiments are designed to assess the contribution of word embeddings, in particular initializations of word embeddings, to tagging performance. Model 1 that uses randomly initialized word vectors achieves an f1 measure of 85.75% on our validation set. Although, Hur and FastText pretrained embeddings both contribute to that performance with 1.46% and 3.69% increases respectively, the highest addition of 5.82% comes from Huaw embeddings. We also observe similar performance improvements on the test set as shown in Table 5. The results that we obtain from Models 4, 6, and 8 as compared to Model 1 motivate the need for pretrained embeddings as a good starting point. Moreover, a bigger dataset and larger word embeddings result in substantial improvements on measured performance. Moving from Hur embeddings (Model 4) to Huaw word embeddings (Model 8) provide an increase of 4.36% on the validation set and 3.92% on the test set. An increase of 2.23% on validation and 2,07% on test sets are observed when we shift from Hur embeddings (Model 4) to FastText embeddings (Model 6) and we relate this change to dimensional differences between these embeddings. Huaw embeddings (Model 8) improve f1 scores by 2.13% on validation and 1.85% on test sets as compared to FastText embeddings (Model 6). This is possibly due to different methods utilized in learning representations since FastText treats each word as a composition of character ngrams, whereas Huaw embeddings are obtained by treating each word as a single token.

Our final set of experiments, in line with previous research, also confirms that the use of a CRF layer on top of the underlying architecture significantly improves f1 measures both on validation and test sets (Models 2 and 8). In a sequence labeling task, it is not surprising to see a positive effect of modeling dependencies between consecutive input tokens. In addition, f1 score obtained from Model 5 by utilizing a CRF layer is higher than that obtained from Model 2 where character embeddings are used. However, in our experiments we do not measure any notable improvements once morphological embeddings are incorporated (Models 7 and 8) and this does not support the findings of Gungor:19 where a higher improvement is obtained with the addition of morphological information. As shown in Figure 5 performances of these two models are very similar during training and validation. One particular reason for this divergence might be the fact that in that previous study, a character-only model was not used with a larger dimension as we did in our work.

This experimental study reveals that the learning method used to obtain word embeddings matters, so do their dimensions; which is supported by the work of Melamud:16. The importance of embeddings is also mentioned in the work of Ma:16 where GloVe embeddings lead to the highest performance on English. Finally, our experiments strengthen the effectiveness of utilizing character embeddings as demonstrated in the work of Kuru:16. This finding might be attributed to morphological information that may possibly be carried by characters.

Figure 5: Performance Comparisons of Character and Character-Morphological Embeddings During Training and Validation.

5.2 Experiments on Transformer-Based Networks

*For XLMR-large, reported training time is with a larger instance type, C5.9xlarge. All other models are trained with C5.4xlarge instance type on AWS Elastic Compute Cloud service 101010All training was done using machines from Amazon Web Services with instance type C5.4xlarge (https://aws.amazon.com/ec2/instance-types/c5/). These machines have Intel Xeon Scalable Processors (Cascade Lake) with a sustained all-core Turbo CPU frequency of 3.6GHz, 16 vCPUs and 32 GiB of memory.. Results are averaged over 5 random initializations. Model # Model Description Valid.Test F1 Precision Recall Accuracy Trn. Time 1 mBERT-CRF Valid 92.65 91.90 93.42 98.92 03:47:00 Test 92.35 91.54 93.17 98.90 2 mBERT Valid 92.73 92.07 93.40 98.96 01:48:25 Test 92.59 91.74 93.45 98.94 3 XLMR-b-CRF Valid 93.29 92.65 93.95 99.02 03:52:08 Test 93.89 93.10 94.69 99.11 4 XLMR-b Valid 93.29 92.61 93.99 99.05 01:55:42 Test 94.01 93.10 94.93 99.15 5 XLMR-l* Valid 94.56 93.90 95.24 99.21 03:21:20 Test 94.82 93.99 95.66 99.28 6 BERTurk Valid 94.87 94.37 95.38 99.28 01:39:42 Test 95.75 95.41 96.10 99.41 7 BERTurk-CRF Valid 94.90 94.48 95.33 99.28 03:44:16 Test 95.95 95.60 96.31 99.42

Table 6: Performance Scores of Transformer-Based Models on Our Validation and Test Sets

In literature, there exists only one work where a transformer-based language model, in particular BERT, was applied to NER task and that work was shown to outperform the state-of-the-art results obtained by BiLSTM networks (as shown in Table 2). However, there are a few other transformer-based large language models whose performances have not been reported for Turkish tagging task. Additionally, to our best knowledge, the impact of CRF on such models has not been evaluated before. Thus, our experiments on transformer-based networks are oriented around these research questions.

For this set of experiments, we trained three different networks where a multilingual cased BERT language model was used with and without a CRF layer at the top in the first architecture. Similarly, XLM-RoBERTa (XLMR) and Turkish BERT (BERTurk) language models were utilized along with CRF layers in the second and third networks, respectively. The results of these experiments on validation and test sets are reported in Table 10.

Our first observation reveals that mBERT (Models 1 and 2) performs comparably poorer than other models and BERTurk (Models 6 and 7) obtains highest f1 scores on both datasets. The results are as we expected for XLMR models; a higher performance is obtained once XLMR large (Model 5) is used rather than XLMR base (Model 4) with less number of hidden units and layers. Comparing multilingual models, we find that XLMR-b performs better than mBERT (with 0.56% and 1.42% increases on validation and test datasets), and XLMR-l enhances this improvement by an additional rise of 1.27% on validation set and 0.81% on test set. In the lirerature, XLMR was shown to improve NER benchmarks in multiple languages (Conneau:19), and our findings provide additional support by showing that Turkish language is better represented with XLMR than mBERT. Some of this difference might be attributed to better subword token production by XLMR. Moreover, XLMR tokenizer produces more similar tokens to BERTurk tokenizer and uses a larger model settings and corpus. We argue that, due to these reasons, it achieves a closer performance to BERTurk (0.93% difference on test set) than mBERT (2.23% difference on test set).

It is quite surprising to measure lower performances in mBERT (Model 1) and XLMR (Model 3) models when it is accompanied with a CRF layer. However, CRF on BERTurk (Model 7) slightly improves f1 scores with an increase of 0.03% on valid and 0.2% on test sets (as compared to Model 6), respectively. Although none of these performance changes are significant, our results correlate with previous studies that perform well without using a CRF layer (Devlin:18; Conneau:19).

Our final and most important finding is that all transformer-based models outperform BiLSTM models on Turkish NER task as shown in Figure 6. The comparison between Model 8 from Table 5 and Model 1 from Table 10 shows that at least an increase of 1.08% on valid set and 0.54% on test set. That improvement achieved with BERTurk-CRF (Model 7 in Table 10) is at most 3.33% on valid set and 4.11% on test set, respectively111111One particular disadvantage of transformer-based models is the observed slow inference time (between 98-211 seconds) as compared to BiLSTM models (between 8-13 seconds)..

Figure 6: Performance Comparisons of BiLSTM and Transformer-Based Models on Test Set

6 Conclusion and Future Work

Recent years have witnessed a surge of interests in Turkish named entity recognition. This study presents our empirical evaluations of recent neural sequence tagging models on Turkish NER task by providing a high-level comparison of different model settings and design considerations. Our results provide insights into the importance of word representations (i.e., character, morphological, subword, and word embeddings) and their initializations (i.e., random or pretrained initializations) in BiLSTM networks. Our experiments also include a comprehensive evaluation of neural architectures that utilize popular multilingual transformer-based language models on Turkish entity tagging. Their comparisons with BiLSTM models reveal their superior performance on the evaluation set and highlight the positive impact of transfer learning. In this work, we also propose a state-of-the-art transformer-based architecture with a CRF layer that achieves the highest f-measure of 94.90% and 95.95% on the validation and test sets, respectively.

As our future work, we plan to aggregate character and morphological embeddings with transformer-based language models and assess their impact on the overall performance. We also intend to study other word embeddings in BiLSTM networks, especially those that were shown to be effective in other morphologically rich languages such as Flair (Akbik:18). Finally, we plan to develop new subword tokenizers such as a tokenizer that returns morphemes attached to a word as produced by a morphological analyzer.

References