In 2018, chun2018building published on three dependency treebanks in Korean that followed the latest guidelines from the Universal Dependencies (UD) project, that was UDv2. These treebanks were automatically derived from the existing treebanks, the Penn Korean Treebank (PKT; han2001penn), the Google UD Treebank mcdonald2013universal, and the KAIST Treebank choi1994kaist
, using head-finding rules and heuristics.
This paper first addresses the known issues in the original Penn Korean UD Treebank, henceforth PKT-UD v2018, through a sampling-based analysis (Section 3), and then describes the revised guidelines for both part-of-speech tags and dependency relations to handle those issues (Section 4). Then, a transformer-based dependency parsing approach using biaffine attention is introduced (Section 5) to experiment on both PKT-UD v2018 and the revised version, henceforth PKT-UD v2020 (Section 6). Our analysis shows a significantly reduced number of mispredicted labels by the parsing model developed on PKT-UD v2020 compared to the one developed on PKT-UD v2018, confirming the benefit of this revision in parsing performance. The contributions of this work are as follows:
Issue checking in PKT-UD v2018.
Revised annotation guidelines for Korean and the release of the new corpus, PKT-UD v2020.
Development of a robust dependency parsing model using the latest transformer encoder.
2 Related Works
2.1 Korean UD Corpora
According to the UD project website,111https://universaldependencies.org three Korean treebanks are officially registered and released: the Google Korean UD Treebank mcdonald2013universal, the Kaist UD Treebank choi1994kaist, and the Parallel Universal Dependencies Treebank zeman-etal-2017-conll. These treebanks were created by converting and modifying the previously existing treebanks. The Korean portion of the Google UD Treebank had been re-tokenized into the morpheme level in accordance with other Korean corpora, and systematically corrected for several errors chun2018building. The Kaist Korean UD Treebank was derived by automatic conversion using head-finding rules and linguistic heuristics chun2018building. The Parallel Universal Dependencies Treebank was designed for the CoNLL 2017 shared task on Multilingual Parsing, consisting of 1K sentences extracted from newswires and Wikipedia articles.
The Penn Korean UD Treebank and the Sejong UD Treebank were registered on the UD website as well but unreleased due to their license issues. Similar to the Kaist UD Treebank, the Penn Korean UD Treebank222The annotation with the word-forms of the Penn Korean UD Treebank can be found here: https://github.com/emorynlp/ud-korean. was automatically converted into UD structures from phrase structure trees chun2018building. The Sejong UD Treebank was also automatically converted from the Sejong Corpus, a phrase structure Treebank consisting of 60K sentences from 6 genres choi2011statistical.
|Genre||Blog, News||Litr, News, Acdm, Mscr||Blog, News||Milt||Litr, News, Acdm, Mscr|
In a related effort, the Electronic and Telecommunication Research Institute (ETRI) in Korea conducted a research on standardizing dependency relations and structures lim:15a. This effort resulted in the establishment of standard annotation guidelines of Korean dependencies, giving rise to various related efforts that focused on the establishment of Korean UD guidelines that better represent the unique Korean linguistic features. These studies include park2018 who focused on the mapping between the UD part-of-speech (POS) tags and the POS tags in the Sejong Treebank, and Lee2019 and Oh2019 who provided in-depth discussions of applicability and relevance of UD’s dependency relation to Korean.
2.2 Penn Korean UD Treebank
As mentioned in Section 2.1, the Penn Korean UD Treebank (PKT-UD v2018) was automatically derived from phrase-structure based the Penn Korean Treebank and the results were published by chun2018building. Even so, it currently does not number among the Korean UD treebanks officially released corpora under the UD project website.
Our effort to officially release chun2018building’s PKT-UD v2018 has uncovered numerous mechanical errors caused by the automatic conversion and few other unaddressed issues, leading us to a full revision of this corpus. PKT-UD v2018 made targeted attempts at addressing a number of language-specific issues regarding complex structures such as empty categories, coordination structures, and allocation of POS tags with respect to dependency relations. However, the efforts were limited, leaving other issues such as handling of copulas, proper allocation of verbs according to their verbal endings, and grammaticalized multi-word expressions were unanswered. Thus, this paper aims to address those remaining issues while revising PKT-UD v2018 to clearly represent phenomena in Korean.
3 Observations in PKT-UD v2018
The Penn Korean Treebank (PKT) was originally published as a phrase-structure based treebank by han2001penn. PKT consists of 5,010 sentences from Korean newswire including 132,041 tokens.333While most Korean resources have what is known as Eojeol representing a token and white space is used as delimiter, PTK tokenizes apart symbols, punctuation and even occasional morphemes where strictly required by syntactic structure. Following the UDv2 guidelines, chun2018building systematically converted PKT to PKT-UD v2018. While this effort achieved a measure of success at providing phrase-structure-to-dependency conversion in a manner consistent across three different treebanks with distinct grammatical frameworks, it stopped short of addressing more nuanced issues that arise from aligning grammatical features of Korean, that is a heavily agglutinative language, to the universal standards put forth by UDv2. In building PKT-UD v2018, the POS tags were largely mapped in a categorical manner from the Penn Korean POS tagset. The dependency relations on the other hand were established via head-finding rules that relied on Penn Korean Treebank’s existing function tags, phrasal tags, and morphemes.
chun2018building did make a few targeted attempts at teasing apart more fine-grained nuances of grammatical functions. For example, the PKT POS tag (XPOS) DAN was subdivided into the UD POS tag (UPOS) DET for demonstrative prenominals (e.g., 이 (this), 그 (the), and the UPOS ADJ for attributive adjectives (e.g., 새 (new), 헌 (old)) in the recognition that the XPOS DAN, focusing primarily on grammatical distribution, conflated two semantically distinct elements. However, such efforts were limited in scope, and the project did not examinethe full breadth of language-specific issues.
Moreover, the converted annotation was found to contain a share of mechanical errors. A case in point, what should have been 5,010 sentences were found to contain 5,036 roots, suggesting low-level parsing errors. Additionally, a manual examination of the first five sentences in the corpus uncovered a variety of syntactic errors that raised an alarm. The worst of the five examined sentences is shown in Figure 1 (and continued in Figure 2) with errors in both the UPOS and the dependency relation labels (DEPREL). While we will not delve into particulars of each error seen in this example, the example provides a general sense for the extent of errors existent that merited our attention.
These observed issues inspired us to revise PKT-UD v2018, with the aim of producing cleaner syntactic annotations that would be more faithful to the Korean grammar. The following section provides specifics of the revision content.
4 PKT-UD Revisions
4.1 UPOS Revision
Revision of the UPOS portion of the resource was done from the ground up. That is, instead of correcting PKT-UD v2018’s UPOS annotations, we implemented a new mapping from XPOS to UPOS after a careful re-examination of the original mapping schema. In particular, we consulted the POS mapping guidelines by park2018 whose morphological tagset, carried over from the Sejong Project (kim2006korean), differs from PKT’s in some key aspects. However, we found their nuanced view of grammatical characteristics and typology of Korean in reference to the UDv2 very much applicable. The followings illustrate key ideas of of our UPOS revision approach. Below and throughout this paper, we italicize XPOS labels (e.g., DAN) so they are visually distinct from UPOS labels (e.g., ADJ).
Copulas mapped to Adj
One major target of revision was the scope of the UPOS adjective label ADJ in Korean, which includes typical predicative adjectives such as ‘예쁘-’ (pretty) and ‘다르-’ (different). As mentioned in Section 3, PKT-UD v2018 already extended the ADJ label to include the closed class of adjectives whose distribution is limited to pre-nominal, attributive use which had been grouped together with the determiner category DAN in the original PKT. In our current work, we further extend the ADJ label to encompass the copula: CO (‘-이-’ (be)). In Korean, ‘-이-’ (be) is a copula particle that attaches to a nominal to produce a predicate, much like the English ‘be’. However, such copula-derived predicates in Korean are known to share semantic and syntactic traits with adjectives rather than verbs, chief among which being their inability to take on the present/habitual aspect verbal ending ‘-는다‘ (do) which is only allowed on verbs. In light of this, we made a decision to map all instances of XPOS’ CO to UPOS’ ADJ.
Consistent Noun focusing on morpheme roles
Korean is well-known as an agglutinative language, and Josas (postpositions) are extremely common nominal suffixes that can indicate a variety of syntactic roles of the whole Eojeol unit (Figure 3). For example, when an adverbial case particle (‘에’, PAD) attaches to a noun, the resulting Eojeol serves the syntactic role of an adverb. When a conjunctive particle (‘와’, PCJ) is used, the Eojeol functions as a noun conjunct. Consequently, PKT-UD v2018 mapped ADV to the former and CCONJ to the latter.
|(school)||(at school)||(school and)|
However, this distinction underscores a syntactic role rather than a morphological one: while the syntactic role changes with the attachment of the postposition, the POS of the noun itself remains unaffected. UPOS, as a marker that solely demonstrates morphological characteristics of Eojeol rather than its syntactic function, should reflect the morphological status of the nominal. Therefore, we made a decision to allocate the NOUN label to these cases.
Verbal endings signal Verb
Korean has verbal endings on predicates that dictate the syntactic role of Eojeol (Figure 4). In PKT-UD v2018, predicates marked with ENM (nominalization verbal ending) and ECS (conjunctive ending) are mapped to NOUN and SCONJ, respectively. However, as with the earlier case involving nominals, these verbal ending suffixes should not be treated as fundamentally altering the underlying POS of the predicate itself. This work revises both cases of UPOS to VERB. Extending the same principle, parallel cases with the same verbal endings involving an adjective or a copula were likewise re-assigned to ADJ.
Statistics of v2018 and v2020
The complete distributions of PKT-UD v2018 and v2020 are listed in Table 2.
4.2 DEPREL Revision
In re-examining PTK-UD v2018’s dependency relations, we consulted two existing dependency annotation guidelines for Korean: Lee2019 and Oh2019. They offer a thorough analysis on applicability of the universal dependency relation labels to Korean, and further identify a list of dependency relations such as iobj, xcomp, expl, and cop (among others) as not suited for capturing characteristics of Korean grammar. Additionally, where applicable, we took into consideration the UD Japanese Treebank (asahara2018universal), since Japanese exhibits many parallel syntactic phenomena as another strictly head-final agglutinative language (kanayama2018coordinate).
Reevaluation of iobj
We turned our attention to iobj, the DEPREL label for indirect object. We found PKT-UD v2018’s decision to assign nominals with dative case markings to iobj questionable, for the following reasons. First, unlike English, where word order distinguishes indirect objects from direct objects (e.g. “She gave me:iobj a box:obj”), Korean has no such structural constraint that forms the basis for identifying instances of iobj. The only potential identifier, then, is dative postpositions such as ‘-에게’(to) and ‘-한테’(by), which correspond roughly to English preposition ‘to’ as in “She gave it to me”. The problem is, these markers do not exclusively encode the dative case, as seen in examples such as “개에게 물렸다" (“I was bit by a dog”).
Hence, we adopted a new approach of reassigning all instances of iobj to the oblique relation obl. This move brings language-internal consistency, as postpositions, in many instances, can simply be dropped if contextually recoverable, rendering any such nominals practically indistinguishable from other nominal adverbials that are assigned to obl. This overall approach is also in line with UD Japanese Treebank, where iobj is categorically absent and ‘に (ni)’, a postposition whose usage largely parallels the two Korean postpositions above, mapping to obl.
Standardizing verbal predicates
As shown in Figure 5, Korean predicates take on various syntactic functions depending on the attached verbal ending. Predicates with the declarative verbal ending ‘-다’ (ta) are assigned to root, which is straightforward. Endings ‘-은’ (un) and ‘-을’ (ul) on the other hand turn the verb into a modifier to an upcoming noun; the acl relation therefore is the best fit here. Predicates with endings such as ‘-어서’ (ese) and ‘-게’ (key) modify other predicates, which calls for an advcl assignment. In PKT-UD v2018, these cases had received an array of inconsistent allocations such as clausal complements (ccomp/xcomp), auxiliaries (aux), and conjuncts (conj). These were corrected to acl and advcl.
Orphaned postpositions and verbal endings
In Korean, verbal endings and postpositions are bound to verbs and nominals, respectively, and cannot occupy their own Eojeol. In natural text, however, they can occasionally be separated from the constituent they attach to via quotation marks, white spaces, or parentheses. PKT-UD v2018 had assigned such orphaned bound morphemes to UPOS of PART (particle) and ADP (adposition) with the DEPREL of mark (marker) and case (case marker), respectively as seen in Figure 6.
However, verbal endings and postpositions can express syntactic function only if they are attached to their modifying predicates and nominals. While PKT-UD v2018’s assignment of the UPOS and DEPREL are not categorically incorrect, they address morphological relationship between these morphemes rather than their syntactic relationship. That is, even if these bound morphemes are notationally distanced from their heads by punctuation or white spaces, they form a single syntactic unit with their nominals and postpositions. Hence, mark and case were updated to goeswith, used for divided words as seen in Figure 7, making it clear that the seemingly separate Eojeols (e.g. nominal and postposition) are actually one unit.
Similar revisions were applied to copulas. Korean copula morpheme ‘-이-’ (i) combines with a nominal on the left and a verbal ending to the right. These copulas too can occasionally be detached via intervening punctuation or white space. To such cases, PKT-UD v2018 had assigned cop as the DEPREL. These instances have been updated to goeswith in accordance with the treatment given to verbal endings and postpostions.
roots and flats
The number of root is adjusted from 5,036 to 5,010 after correcting sentences with zero or more roots. Additionally, DEPREL of Eojeols that used to be incorrectly mapped to compound are now assigned to flat.
Statistics of v2018 and v2020
The complete DEPREL distributions of PKT-UD v2018 and v2020 are listed in Table 3.
5 Parsing Approach
Our dependency parsing model is based on the biaffine parser using contextualized embeddings such as BERT devlin-etal-2019-bert that has shown the state-of-the-art results on both syntactic and semantic dependency parsing tasks in multiple languages he-choi-2019. This model is simplified from the original biaffine parser introduced by dozat:17a such that trainable token embeddings are removed and lemmas are used instead of word forms. This section proposes an even more simplified model that no longer uses embeddings from POS tags, so it can be easily adapted to languages that do not have dedicated POS taggers, and drops the Bidirectional LSTM encoder while integrating the transformer layers directly into the biaffine decoder so that it minimizes the redundancy of having multiple encoders for the generation of contextualized embeddings.
Given an input sentence, every token is first segmented into one or more sub-tokens by the SentencePiece tokenizer kudo-richardson-2018-sentencepiece and fed into a transformer. The output embedding corresponding to the first sub-token of is treated as the embedding representation of , say
, and fed into four types of multilayer perceptron (MLP) layers to extract features forbeing a head (*-h) or a dependent (*-d) for the arc relations (arc-*) and the labels (rel-*) ( and are the dimensions of the arc and label representations, respectively):
All feature vectors,, from each representation are stacked into a matrix ( is the number of tokens in a sentence); these matrices together are used to predict dependency relations among every token pairs. Note that bias terms are appended to the feature vectors that represent dependent nodes to estimate the likelihood of a certain relation given only the head node:
The bilinear and biaffine classifiers are then used for the arc and label predictions respectively, where, and are trainable parameters, and is the number of dependency labels. In particular, a separate weight matrix is dedicated for the prediction of each label:
Once the arc score matrix
and the label score tensorare generated by those classifiers, the Chu-Liu-Edmond’s maximum spanning tree (MST) algorithm is applied to for the arc prediction, then the label with largest score in corresponding to the arc is taken for the label prediction:
To extrinsically assess the quality of our revision, parsing models are separately developed on PKT-UD v2018 and v2020; in other words, v2018 models are trained and evaluated on PKT-UD v2018 whereas v2020 models are trained and evaluated on PKT-UD v2020. The transformer-based parsing approach in Section 5
is used to develop all models. For each version of the corpus, three models are developed by initializing neural weights with different random seeds and the average accuracy and its standard deviation is reported for each version. The entire corpus is divided into the training (TRN), development (DEV), and evaluation (TST) sets by following the 80/10/10% split (Table 4).
|# of Sentences||4,010||501||500|
|# of Tokens||105,947||13,088||13,023|
The multilingual BERT444 https://github.com/google-research/bert/blob/master/multilingual.md is used as the transformer encoder in our parsing models devlin-etal-2019-bert. All models are optimized by the sum of softmax cross-entropy losses on the gold dependency heads and labels. AdamW loshchilov2018decoupled
is used as the optimizer with the learning rate of 5e-06 for the BERT weights and 5e-05 for the rest. The learning rate is scheduled as a combination of both linear warm-up and decay phases. The models are trained for 100 epochs with a batch size of 150. Following the standard practice, we evaluate our best models with the unicode punctuation ignored using the unlabeled attachment score (UAS) and the labeled attachment score (LAS).
Table 5 shows the results achieved by the v2018 and v2020 models. The v2020 model shows a significantly improvement of 3.0% in LAS over the v2018 model. This makes sense because the major parts of the revision are dedicated to DEPREL consistency, yielding more robust parsing performance in labeling. The v2020 model also gives a good improvement of 0.6% in finding dependency arcs. The improved parsing results ensure the higher quality annotation in PKT-UD v2020 that is encouraging.
|v2018||90.7 (0.2)||86.0 (0.1)|
|v2020||91.3 (0.1)||89.0 (0.1)|
7 Error analysis
We perform an error analysis on the parsing outputs generated by the v2018 model. Our analysis shows that the head error occurred in 1,360 Eojeols and the label error occurred in 4,292 Eojeols. Table 6 shows the distribution of head and label errors per label based on the revised test set. The relations advcl, nummod, acl, and obl have a high error rate, which are due to the inconsistencies seen in the data we handled by establishing clear criteria. Moreover, the labels goeswith and flat saw 100% error, again, due to the errors we observed during the revision process.
There is an observable trend in these errors. For example, a number of error cases report advcl as xcomp, conj, or ccomp while nummod tends to be wrongly parsed to compound, acl to ccomp, and obl to advcl. Multiple cases of parsing errors due to errors in the UPOS are also found. Incorrect UPOS appears to commit errors while allocating edge and DEPREL. The annotation guideline based on XPOS is already described in Section 4.
After revising the data according to the criteria presented in Section 4, many improvements have been made. The error rate of advcl decreased from 98.93% to 2.36%, the nummod also decreased significantly from 97.37% to 0.5%, and the acl error from 86.73% to 0.9%. The error rate of obl was also reduced from 79.14% to 5.5%. In addition, the error rate is reduced for goeswith and flat. In the case of ccomp, errors decreased by more than 35% from 44.51% to 8.67%. These results is indicative of the effect of improving training data by ensuring consistency of annotations.
In this study, we revise the Penn Korean Universal Dependency Treebank (PKT-UD) and compare parsing performance between models trained on the original and revised versions of PKT. Our new guidelines follow the UDv2 guidelines. UPOS and DEPREL are revised to reflect Korean morphological features and flexible word-order aspects with reference to Korean UD studies such as park2018, Lee2019, and Oh2019. In UPOS, ADJ, NOUN, and VERB are revised extensively. In DEPREL, iobj, acl, advcl, and goeswith are thoroughly revised. The revision results showing the percentage change of each label are presented in Table 2 and Table 3.
As a result of the parsing experiment, the v2020 model improves UAS by 0.6% and LAS by 3.0% over the v2018 model. In particular, obl, acl, nummod, and advcl errors are significantly reduced. This study, which improves parsing accuracy by applying characteristics of Korean, can also contribute to improve the quality of other Korean UD treebanks. In the future, we will explore the possibility of extending PKT-UD with enhanced dependency types555https://universaldependencies.org/u/overview/enhanced-syntax.html by incorporating empty categories from the original PKT.