Semantic role labeling with subwords (character, character-ngram and morphology)
Character-level models have become a popular approach specially for their accessibility and ability to handle unseen data. However, little is known on their ability to reveal the underlying morphological structure of a word, which is a crucial skill for high-level semantic analysis tasks, such as semantic role labeling (SRL). In this work, we train various types of SRL models that use word, character and morphology level information and analyze how performance of characters compare to words and morphology for several languages. We conduct an in-depth error analysis for each morphological typology and analyze the strengths and limitations of character-level models that relate to out-of-domain data, training data size, long range dependencies and model complexity. Our exhaustive analyses shed light on important characteristics of character-level models and their semantic capability.READ FULL TEXT VIEW PDF
When parsing morphologically-rich languages with neural models, it is
We examine the role of character patterns in three tasks: morphological
When walking on loose terrains, possibly covered with vegetation, the gr...
We explore the ability of word embeddings to capture both semantic and
Usernames are ubiquitous on the Internet, and they are often suggestive ...
From a grammar point of view, the role of punctuation marks in a sentenc...
Deep learning algorithms for connectomics rely upon localized classifica...
Semantic role labeling with subwords (character, character-ngram and morphology)
Encoding of words is perhaps the most important step towards a successful end-to-end natural language processing application. Although word embeddings have been shown to provide benefit to such models, they commonly treat words as the smallest meaning bearing unit and assume that each word type has its own vector representation. This assumption has two major shortcomings especially for languages with rich morphology: (1) inability to handle unseen or out-of-vocabulary (OOV) word-forms (2) inability to exploit the regularities among word parts.
The limitations of word embeddings are particularly pronounced in sentence-level semantic tasks, especially in languages where word parts play a crucial role. Consider the Turkish sentences “Köy+lü-ler (villagers) şehr+e (to town) geldi (came)” and “Sendika+lı-lar (union members) meclis+e (to council) geldi (came)”. Here the stems köy (village) and sendika (union) function similarly in semantic terms with respect to the verb come (as the origin of the agents of the verb), where şehir (town) and meclis (council) both function as the end point. These semantic similarities are determined by the common word parts shown in bold. However ortographic similarity does not always correspond to semantic similarity. For instance the ortographically similar words knight and night have large semantic differences. Therefore, for a successful semantic application, the model should be able to capture both the regularities, i.e, morphological tags and the irregularities, i.e, lemmas of the word.
Morphological analysis already provides the aforementioned information about the words. However access to useful morphological features may be problematic due to software licensing issues, lack of robust morphological analyzers and high ambiguity among analyses. Character-level models (CLM), being a cheaper and accessible alternative to morphology, have been reported as performing competitively on various NLP tasks Ling et al. (2015); Plank et al. (2016); Lee et al. (2017). However the extent to which these tasks depend on morphology is small; and their relation to semantics is weak. Hence, little is known on their true ability to reveal the underlying morphological structure of a word and their semantic capabilities. Furthermore, their behaviour across languages from different families; and their limitations and strengths such as handling of long-range dependencies, reaction to model complexity or performance on out-of-domain data are unknown. Analyzing such issues is a key to fully understanding the character-level models.
To achieve this, we perform a case study on semantic role labeling (SRL), a sentence-level semantic analysis task that aims to identify predicate-argument structures and assign meaningful labels to them as follows:
Villagerscomers came to townend point
We use a simple method based on bidirectional LSTMs to train three types of base semantic role labelers that employ (1) words (2) characters and character sequences and (3) gold morphological analysis. The gold morphology serves as the upper bound for us to compare and analyze the performances of character-level models on languages of varying morphological typologies. We carry out an exhaustive error analysis for each language type and analyze the strengths and limitations of character-level models compared to morphology. In regard to the diversity hypothesis which states that diversity of systems in ensembles lead to further improvement, we combine character and morphology-level models and measure the performance of the ensemble to better understand how similar they are.
We experiment with several languages with varying degrees of morphological richness and typology: Turkish, Finnish, Czech, German, Spanish, Catalan and English. Our experiments and analysis reveal insights such as:
CLMs provide great improvements over whole-word-level models despite not being able to match the performance of morphology-level models (MLMs) for in-domain datasets. However their performance surpass all MLMs on out-of-domain data,
Limitations and strengths differ by morphological typology. Their limitations for agglutinative languages are related to rich derivational morphology and high contextual ambiguity; whereas for fusional languages they are related to number of morphological tags (morpheme ambiguity) ,
CLMs can handle long-range dependencies equally well as MLMs,
In presence of more training data, CLM’s performance is expected to improve faster than of MLM.
Neural networks have been first introduced to the SRL scene by Collobert et al. (2011), where they use a unified end-to-end convolutional network to perform various NLP tasks. Later, the combination of neural networks (LSTMs in particular) with traditional SRL features (categorical and binary) has been introduced FitzGerald et al. (2015). Recently, it has been shown that careful design and tuning of deep models can achieve state-of-the-art with no or minimal syntactic knowledge for English and Chinese SRL. Although the architectures vary slightly, they are mostly based on a variation of bi-LSTMs. Zhou and Xu (2015); He et al. (2017) connect the layers of LSTM in an interleaving pattern where in Wang et al. (2015); Marcheggiani et al. (2017) regular bi-LSTM layers are used. Commonly used features for the encoding layer are: pretrained word embeddings; distance from the predicate; predicate context; predicate region mark or flag; POS tag; and predicate lemma embedding. Only a few of the models Marcheggiani et al. (2017); Marcheggiani and Titov (2017) perform dependency-based SRL. Furthermore, all methods focus on languages with rich resources and less morphological complexity like English and Chinese.
Character-level models have proven themselves useful for many NLP tasks such as language modeling Ling et al. (2015); Kim et al. (2016), POS tagging Santos and Zadrozny (2014); Plank et al. (2016), dependency parsing Dozat et al. (2017) and machine translation Lee et al. (2017). However the number of comparative studies that analyze their relation to morphology are rather limited. Recently, Vania and Lopez (2017) presented a unified framework, where they investigated the performances of different subword units, namely characters, morphemes and morphological analysis on language modeling task. They experimented with languages of varying morphological typologies and concluded that the performance of character models can not yet match the morphological models, albeit very close. Similarly, Belinkov et al. (2017) analyzed how different word representations help learn better morphology and model rare words on a neural MT task and concluded that character-based representations are much better for learning morphology.
Formally, we generate a label sequence for each sentence and predicate pair: . Each is chosen from , where are language-specific semantic roles (mostly consistent with PropBank) and is a symbol to present tokens that are not arguments. Given as model parameters and as gold label for token, we find the parameters that minimize the negative log likelihood of the sequence:
Label probabilities,, are calculated with equations given below.First, the word encoding layer splits tokens into subwords via function.
As proposed by Ling et al. (2015), we treat words as a sequence of subword units. Then, the sequence is fed to a simple bi-LSTM network Graves and Schmidhuber (2005); Gers et al. (2000) and hidden states from each direction are weighted with a set of parameters which are also learned during training. Finally, the weighted vector is used as the word embedding given in Eq. 4.
There may be more than one predicate in the sentence so it is crucial to inform the network of which arguments we aim to label. In order to mark the predicate of interest, we concatenate a predicate flag to the word embedding vector.
Final vector, serves as an input to another bi-LSTM unit.
Finally, the label distribution is calculated via softmax function over the concatenated hidden states from both directions.
For simplicity, we assign the label with the highest probability to the input token. 111Our implementation can be found at https://github.com/gozdesahin/Subword_Semantic_Role_Labeling.
We use three types of units: (1) words (2) characters and character sequences and (3) outputs of morphological analysis. Words serve as a lower bound; while morphology is used as an upper bound for comparison. Table 1 shows sample outputs of various functions.
function simply splits the token into its characters. Similar to n-gram language models,char3 slides a character window of width over the token. Finally, gold morphological features are used as outputs of morph-language. Throughout this paper, we use morph and oracle interchangably, i.e., morphology-level models (MLM) have access to gold tags unless otherwise is stated. For all languages, morph outputs the lemma of the token followed by language specific morphological tags. As an exception, it outputs additional information for some languages, such as parts-of-speech tags for Turkish. Word segmenters such as Morfessor and Byte Pair Encoding (BPE) are other commonly used subword units. Due to low scores obtained from our preliminary experiments and unsatisfactory results from previous studies Vania and Lopez (2017), we excluded these units.
We use the datasets distributed by LDC for Catalan (CAT), Spanish (SPA), German (DEU), Czech (CZE) and English (ENG) Hajič et al. (2012b, a); and datasets made available by Haverinen et al. (2015); Şahin and Adalı (2017) for Finnish (FIN) and Turkish (TUR) respectively 222Turkish PropBank is based on previous efforts Atalay et al. (2003); Sulubacak et al. (2016); Sulubacak and Eryiğit (2018); Oflazer et al. (2003); Şahin (2016b, a). Datasets are provided with syntactic dependency annotations and semantic roles of verbal predicates. In addition, English supplies nominal predicates annotated with semantic roles and does not provide any morphological feature.
Statistics for the training split for all languages are given in Table 2. Here, #pred is number of predicates, and #role refers to number distinct semantic roles that occur more than 10 times. More detailed statistics about the datasets can be found in Hajič et al. (2009); Haverinen et al. (2015); Şahin and Adalı (2017).
To fit the requirements of the SRL task and of our model, we performed the following:
Multiword expressions (MWE) are represented as a single token, (e.g., Confederación_Francesa_del_Trabajo), that causes notably long character sequences which are hard to handle by LSTMs. For the sake of memory efficiency and performance, we used an abbreviation (e.g., CFdT) for each MWE during training and testing.
Original dataset defines its own format of semantic annotation, such as 17:PBArgM_mod19:PBArgM_mod meaning the node is an argument of and tokens with ArgM-mod (temporary modifier) semantic role. They have been converted into CoNLL-09 tabular format, where each predicate’s arguments are given in a specific column.
Words are splitted from derivational boundaries in the original dataset, where each inflectional group is represented as a separate token. We first merge boundaries of the same word, i.e, tokens of the word, then we use our own function to split words into subwords.
We lowercase all tokens beforehand and place special start and end of the token characters. For all experiments, we initialized weight parameters orthogonally and used one layer bi-LSTMs both for subword composition and argument labeling with hidden size of 200. Subword embedding size is chosen as 200. We used gradient clipping and early stopping to prevent overfitting. Stochastic gradient descent is used as the optimizer. The initial learning rate is set to 1 and reduced by half if scores on development set do not improve after 3 epochs. We use the provided splits and evaluate the results with the official evaluation script provided by CoNLL-09 shared task. In this work (and in most of the recent SRL works), only the scores for argument labeling are reported, which may cause confusions for the readers while comparing with older SRL studies. Most of the early SRL work report combined scores (argument labeling with predicate sense disambiguation (PSD)). However, PSD is considered a simpler task with higher F1 scores333
For instance in English CoNLL-09 dataset, 87% of the predicates are annotated with their first sense, hence even a dummy classifier would achieve 87% accuracy. The best system from CoNLL-09 shared task reports 85.63 F1 on English evaluation dataset, however when the results of PSD are discarded, it drops down to 81.. Therefore, we believe omitting PSD helps us gain more useful insights on character level models.
Our main results on test and development sets for models that use words, characters (char), character trigrams (char3) and morphological analyses (morph) are given in Table 3. We calculate improvement over word (IOW) for each subword model and improvement over the best character model (IOC) for the morph. IOW and IOC values are calculated on the test set.
The biggest improvement over the word baseline is achieved by the models that have access to morphology for all languages (except for English) as expected. Character trigrams consistently outperformed characters by a small margin. Same pattern is observed on the results of the development set. IOW has the values between 0% to 38% while IOC values range between 2%-10% dependending on the properties of the language and the dataset. We analyze the results separately for agglutinative and fusional languages and reveal the links between certain linguistic phenomena and the IOC, IOW values.
have many morphemes attached to a word like beads on a string. This leads to high number of OOV words and cause word lookup models to fail. Hence, the highest IOWs by character models are achieved on these languages: Finnish and Turkish. This language family has one-to-one morpheme to meaning mapping with small orthographic differences (e.g., mış, miş, muş, müş for past perfect tense), that can be easily extracted from the data. Even though each morpheme has only one interpretation, each word (consisting of many morphemes) has usually more than one. For instance two possible analyses for the Turkish word “dolar” are (1) “dol+Verb+Positive+Aorist+3sg” (it fills), (2) “dola+Verb+Positive+Aorist+3sg” (he/she wraps). For a syntactic task, models are not obliged to learn the difference between the two; whereas for a semantic task like SRL, they are. We will refer to this issue as contextual ambiguity. Another important linguistic issue for agglutinative languages is the complex interaction between morphology and syntax, which is usually achieved via derivational morphemes. In other words, unlike inflectional morphemes that only give information on word-level semantics, derivational morphemes provide more clues on sentence-level semantics. The effects of these two phenomena on model performances is shown in Fig. 1. Scores given in Fig. 1 are absolute F1 scores for each model. For the analysis in Fig. 0(a), we separately calculated F1 scores of each model on words that have been observed with at least two different set of morphological features (ambiguous), and one set of features (non-ambiguous). Due to the low number of ambiguous words in Turkish dataset (100), it has been calculated for Finnish only. Similarly, for the derivational morphology analysis in Fig. 0(b), we have separately calculated scores for sentences containing derived words (derivational), and simple sentences without any derivations. Both analyses show that access to gold morphological tags (oracle) provided big performance gains on arguments with contextual ambiguity and sentences with derived words. Moderate IOC signals that char and char3 learns to imitate the “beads” and their “predictable order” on the string (in the absence of the aforementioned issues).
may have many morphemes in a word. Spanish and Catalan have relatively low morpheme per word ratio that results with low OOV% (5.63 and 5.40 for Spanish and Catalan respectively); whereas, German and Czech have OOV% of 7.93 and 7.98 Hajič et al. (2009). We observe that IOW by character models are well aligned with OOV percentages of the datasets. Unlike agglutinative languages, single morpheme can serve multiple purposes in fusional languages. For instance, “o” (e.g., habl-o) may signal person singular present tense, or person singular past tense. We count the number of surface forms with at least two different features and use their ratio (#ambiguous forms/#total forms) as a proxy to morphological complexity of the language. The complexities are approximated as 22%, 16%, 15% for Czech, Spanish and Catalan respectively; which are aligned with the observed IOCs.
Since there is no unique morpheme to meaning mapping, generally multiple morphological tags are used to resolve the morpheme ambiguity. Therefore there is an indirect relation between the number of morphological tags used and the ambiguity of the word. To demonstrate this phenomena, we calculate targeted F1 scores on arguments with varying number of morphological features. Results using feature bins of [1-2], [3-4] and [5-6] are given in Fig. 2. As the number of features increase, the performance gap between oracle and character models grows dramatically for Czech and Spanish, while it stays almost fixed for Finnish. This finding suggests that high number of morphological tags signal the vagueness/complex cases in fusional languages where character models struggle; and also shows that the complexity can not be directly explained by number of morphological tags for agglutinative languages. German is known for having many compound words and compound lemmas that lead to high OOV% for lemma; and also is less ambiguous (9%). Therefore we would expect a lower IOC. However, the evaluation set consists only of 550 predicates and 1073 arguments, hence small changes in prediction lead to dramatic percentage changes.
One way to infer similarity is to measure diversity. Consider a set of baseline models that are not diverse, i.e., making similar errors with similar inputs. In such a case, combination of these models would not be able to overcome the biases of the learners, hence the combination would not achieve a better result. In order to test if character and morphological models are similar, we combine them and measure the performance of the ensemble. Suppose that a prediction is generated for each token by a model , , then the final prediction is calculated from these predictions by:
where is the combining function with parameter . The simplest global approach is averaging (AVG), where is simply the mean function and s are the log probabilities. Mean function combines model outputs linearly, therefore ignores the nonlinear relation between base models/units. In order to exploit nonlinear connections, we learn the parameters of via a simple linear layer followed by sigmoid activation. In other words, we train a new model that learns how to best combine the predictions from subword models. This ensemble technique is generally referred to as stacking or stacked generalization (SG). 444 To train the SG model, we have used one linear layer with 64 hidden units followed by sigmoid nonlinear activation. Weights are orthogonally initialized and optimized via adam algorithm with a learning rate of 0.02 for 25 epochs.
To train the SG model, we have used one linear layer with 64 hidden units followed by sigmoid nonlinear activation. Weights are orthogonally initialized and optimized via adam algorithm with a learning rate of 0.02 for 25 epochs.
Although not guaranteed, diverse models can be achieved by altering the input representation, the learning algorithm, training data or the hyperparameters. To ensure that the only factor contributing to the diversity of the learners is the input representation, all parameters, training data and model settings are left unchanged.
Our results are given in Table 4. IOB shows the improvement over the best of the baseline models in the ensemble. Averaging and stacking methods gave similar results, meaning that there is no immediate nonlinear relations between units. We observe two language clusters: (1) Czech and agglutinative languages (2) Spanish, Catalan, German and English. The common property of that separate clusters are (1) high OOV% and (2) relatively low OOV%. Amongst the first set, we observe that the improvement gained by character-morphology ensembles is higher (shown with green) than ensembles between characters and character trigrams (shown with red), whereas the opposite is true for the second set of languages. It can be interpreted as character level models being more similar to the morphology level models for the first cluster, i.e., languages with high OOV%, and characters and morphology being more diverse for the second cluster.
To expand our understanding and reveal the limitations and strengths of the models, we analyze their ability to handle long range dependencies, their relation with training data and model size; and measure their performances on out of domain data.
Long range dependency is considered as an important linguistic issue that is hard to solve. Therefore the ability to handle it is a strong performance indicator. To gain insights on this issue, we measure how models perform as the distance between the predicate and the argument increases. The unit of measure is number of tokens between the two; and argument is defined as the head of the argument phrase in accordance with dependency-based SRL task. For that purpose, we created bins of [0-4], [5-9], [10-14] and [15-19] distances. Then, we have calculate F1 scores for arguments in each bin. Due to low number of predicate-argument pairs in buckets, we could not analyze German and Turkish; and also the bin [15-19] is only used for Czech. Our results are shown in Fig. 3. We observe that either char or char3 closely follows the oracle for all languages. The gap between the two does not increase with the distance, suggesting that the performance gap is not related to long range dependencies. In other words, both characters and the oracle handle long range dependencies equally well.
We analyzed how char3 and oracle models perform with respect to the training data size. For that purpose, we trained them on chunks of increasing size and evaluate on the provided test split. We used units of 2000 sentences for German and Czech; and 400 for Turkish. Results are shown in Fig. 4. Apparently as the data size increases, the performances of both models logarithmically increase - with a varying speed. To speak in statistical terms, we fit a logarithmic curve to the observed F1 scores (shown with transparent lines) and check the x coefficients, where x refers to the number of sentences. This coefficient can be considered as an approximation to the speed of growth with data size. We observe that the coefficient is higher for char3 than oracle for all languages. It can be interpreted as: in the presence of more training data, char3 may surpass the oracle; i.e., char3 relies on data more than the oracle.
As part of the CoNLL09 shared task Hajič et al. (2009), out of domain test sets are provided for three languages: Czech, German and English. We test our models trained on regular training dataset on these OOD data. The results are given in Table 5. Here, we clearly see that the best model has shifted from oracle to character based models. The dramatic drop in German oracle model is due to the high lemma OOV rate which is a consequence of keeping compounds as a single lemma. Czech oracle model performs reasonably however is unable to beat the generalization power of the char3 model. Furthermore, the scores of the character models in Table 5 are higher than the best OOD scores reported in the shared task Hajič et al. (2009); even though our main results on evaluation set are not (except for Czech). This shows that character-level models have increased robustness to out-of-domain data due to their ability to learn regularities among data.
Throughout this paper, our aim was to gain insights on how models perform on different languages rather than scoring the highest F1. For this reason, we used a model that can be considered small when compared to recent neural SRL models and avoided parameter search. However, we wonder how the models behave when given a larger network. To answer this question, we trained char3 and oracle models with more layers for two fusional languages (Spanish, Catalan), and two agglutinative languages (Finnish, Turkish). The results given in Table 6 clearly shows that model complexity provides relatively more benefit to morphological models. This indicates that morphological signals help to extract more complex linguistic features that have semantic clues.
|F1||I (%)||F1||I (%)|
Although models with access to gold morphological tags achieve better F1 scores than character models, they can be less useful a in real-life scenario since they require gold tags at test time. To predict the performance of morphology-level models in such a scenario, we train the same models with the same parameters with predicted morphological features. Predicted tags were only available for German, Spanish, Catalan and Czech. Our results given in Fig. 5, show that (except for Czech), predicted morphological tags are not as useful as characters alone.
Character-level neural models are becoming the defacto standard for NLP problems due to their accessibility and ability to handle unseen data. In this work, we investigated how they compare to models with access to gold morphological analysis, on a sentence-level semantic task. We evaluated their quality on semantic role labeling in a number of agglutinative and fusional languages. Our results lead to the following conclusions:
For in-domain data, character-level models cannot yet match the performance of morphology-level models. However, they still provide considerable advantages over whole-word models,
Their shortcomings depend on the morphology type. For agglutinative languages, their performance is limited on data with rich derivational morphology and high contextual ambiguity (morphological disambiguation); and for fusional languages, they struggle on tokens with high number of morphological tags,
Similarity between character and morphology-level models is higher than the similarity within character-level (char and char-trigram) models on languages with high OOV%; and vice versa,
Their ability to handle long-range dependencies is very similar to morphology-level models,
They rely relatively more on training data size. Therefore, given more training data their performance will improve faster than morphology-level models,
They perform substantially well on out of domain data, surpassing all morphology-level models. However, relatively less improvement is expected when model complexity is increased,
They generally perform better than models that only have access to predicted/silver morphological tags.
Gözde Gül Şahin was a PhD student at Istanbul Technical University and a visiting research student at University of Edinburgh during this study. She was funded by Tübitak (The Scientific and Technological Research Council of Turkey) 2214-A scholarship during her visit to University of Edinburgh. She was granted access to CoNLL-09 Semantic Role Labeling Shared Task data by Linguistic Data Consortium (LDC). This work was supported by ERC H2020 Advanced Fellowship GA 742137 SEMANTAX and a Google Faculty award to Mark Steedman. We would like to thank Adam Lopez for fruitful discussions, guidance and support during the first author’s visit.
What do Neural Machine Translation Models Learn about Morphology?In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. pages 861–872.
Journal of Machine Learning Research12:2461–2505.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009.The CoNLL-2009 Shared Task: Syntactic and Semantic Dependencies in Multiple Languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task. Association for Computational Linguistics, Stroudsburg, PA, USA, CoNLL ’09, pages 1–18.
Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers.
Chinese Semantic Role Labeling with Bidirectional Recurrent Neural Networks.In EMNLP. pages 1626–1631.