DeepAI
Log In Sign Up

Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

READ FULL TEXT VIEW PDF
04/17/2018

Investigating Backtranslation in Neural Machine Translation

A prerequisite for training corpus-based machine translation (MT) system...
04/15/2021

Simultaneous Multi-Pivot Neural Machine Translation

Parallel corpora are indispensable for training neural machine translati...
04/05/2018

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Although there are increasing and significant ties between China and Por...
10/30/2021

How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus

This paper proposes a tool for efficiently constructing high-quality par...
05/21/2020

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

Multi-word expressions (MWEs) are a hot topic in research in natural lan...
05/18/2022

PreQuEL: Quality Estimation of Machine Translation Outputs in Advance

We present the task of PreQuEL, Pre-(Quality-Estimation) Learning. A Pre...
07/28/2021

Investigating Text Simplification Evaluation

Modern text simplification (TS) heavily relies on the availability of go...

1 Introduction

In recent years, the demand for machine translation (MT) systems has been continuously increasing and its importance is growing, especially for the industrial services. vieira2021understanding; zheng2019testing Companies, such as Google, Facebook, Microsoft, Amazon, and Unbabel continue to conduct research and formulate plans to commercialize applications related to MT.

From the late 1950s, numerous MT-related projects were proceeded by mainly focusing on rule-based and statistical-based approaches before the advent of deep learning technology. As deep learning based neural machine translation (NMT) was proposed and adopted to several researches, it has been gradually figured out that more superior performance can be derived through NMT approach

bahdanau2014neural; vaswani2017attention; lample2019cross; song2019mass.

Followed by the adoption of deep learning based technique, the improvements of computing power (e.g. GPU

) and corresponding enhancement of parallel processing accelerated the advancement of NMT. Recently, release of open source frameworks, such as Pytorch

NEURIPS2019_9015, and lowered accessibility to the big data further facilitated vigorous and diverse research.

However, several issues considering the enhancement of the NMT system remain still. Representatively, limitations in ensuring the quality of data is an unresolved issue. As have previously been studied, the quality of the training data is deeply related to the NMT performance park2020toward; park2021study. The major problem is that the process of building a high-quality parallel corpus is time-consuming and expensive, and it is significantly difficult for low-resource languages, such as Korean. Although data-augmentation techniques, such as back translation edunov2018understanding and copied translation currey2017copied have been introduced, as the human supervision is generally minimized or excluded in the data generation process, the quality of such pseudo-generated parallel corpus cannot be guaranteed burlot2019using; epaliyana2021improving. This restricted the usage of pseudo-generated parallel to complements of human-labeled gold parallel corpus, rather than its substitutes imankulova2017improving.

For the alleviation of above limitations, numerous studies on the collection of high-quality training data have been conducted, such as parallel corpus filtering (PCF) research and Data Dam project. PCF refers to a research field that aims to filter out low-quality noisy data (i.e. sentence pairs) residing in the parallel corpus, and improve the overall quality of the corpus. PCF is currently being applied to various NMT studies and contributed to the advancement of the NMT systems koehn2019findings; park2020quality. While the amount of training data caused significant impact on the statistical-based MT approaches, the quality of data is treated as more important than the amount of data in general deep learning-based MT approaches khayrallah2018impact; koehn-EtAl:2020:WMT. Moreover, Data Dam 111http://www.data-alliance.kr/default/ projects for building high-quality parallel corpora nationally are in progress. In the Republic of Korea, a large number of parallel corpora is open to the public through AI-Hub 222http://aihub.or.kr/, which is organized by the National Information Society Agency (NIA) park2020study.

Following these research trends, where the quality is treated more importantly than the quantity in the data construction process, we analyzed the above Korean-English parallel corpus distributed by AI-Hub. Despite its sufficient amount of data, the quality of corresponding corpus has not been confirmed clearly. This may restrict the unconstrained utilization of such corpus in adoption to the NMT model, as low quality data may degrade the overall performance. In this study, we conducted several quality verification experiments including Linguistic Inquiry and Word Count (LIWC) pennebaker2001linguistic; tausczik2010psychological, and clarified the quality and characteristics of such corpus. By analyzing various factors that can affect NMT performance, we proposed a method that can be applied in future research using the analysis results.

LIWC is a text-analysis tool that automatically analyzes the number of words in a sentence and classifies words with similar meanings and sentimental characteristics. LIWC extracts various interpersonal variables related to clinical, social, physiological, cognitive, psychological, and developmental contexts that cannot be detected using previous text-analysis programs. Additionally, LIWC comprises a variety of features for analyzing text. LIWC generally used to recognize linguistic markers for mental health study in Psychopathology such as detecting Narcissism

holtzman2019linguistic, schizophreniabae2021schizophrenia, bipolar disordersekulic2018not. However, LIWC provides various linguistic features, word count, gender bias and so on, so it can be used for various analyses. In this study, we use LIWC to analyze parallel corpora based on diverse properties. It is also first time to analyze corpus using LIWC.

In addition, we conduct baseline translation experiments by training transformer-base model structure vaswani2017attention through all the parallel corpora given by AIhub. By analyzing MT performance of corresponding models, we propose further research directions on MT for the Korean language. The contributions of this study are as follows:

  • For the first time, we conduct a deep data analysis on AI-Hub data. To the best of our knowledge, this is the first time LIWC has been used to analyze corpora. This study acts as a milestone for further studies on NMT with respect to the Korean language.

  • We conduct baseline translation experiments on all the data in the AI-Hub parallel corpus. Our experiments provide a foundation for further research on Korean-based NMT.

  • We discovered that many factors might cause decreasing model performance, and we provide the direction that those factors could be filtered through our correlation analysis between LIWC and model performance.

2 Related works and Background

2.1 Machine Translation

Machine Translation (MT) refers to a computer system that translates source sentences into target sentences and has achieved significant performance improvements with the advent of deep learning. In 1951, Yehoshua-Bar-hillel first started research on MT at MIT kasher2012language, and it has gradually been developed in the order of rule-based, statistical-based, and deep learning-based MT.


Rule-based Machine Translation Rule-based MT (RBMT) dugast2007statistical; forcada2011apertium

is a translation method based on linguistics rules established by linguists, as well as traditional natural language processing such as lexical analysis, syntax analysis, and semantic analysis. For example, the Korean-English RBMT is a methodology that transfers Korean sentence in accordance with the English grammatical rules based on a process of morphological analysis and synthetic analysis, and translates Korean sentence as the source language into English sentence as the target language. This method has the advantage of conducting ideal translation of sentences that conform to the rules, but has the disadvantage of difficulty in extracting grammatical rules and requiring a lot of linguistic knowledge. It is also difficult to expand the translatable language pairs and numerous rules should be considered.


Statistical Machine Translation Statistical MT (SMT) zens2002phrase; koehn2009statistical is a method of translating using statistical prior knowledge learned from large scale parallel corpus. This method utilizes the alignment and co-occurrence based on statistical information between words from large scale parallel corpus.

SMT contains a translation, reordering and language model. It extracts the alignment of the source sentence and target sentence through the translation model, and predicts the probability of the target sentence through the language model. Unlike RBMT, this methodology can be developed without linguistic knowledge and generally higher performance can be obtained by increasing the amount of data. However, building large amounts of data is a challenging task and the context is difficult to understand, because the translation is carried out in words or phrases basis.

In the case of SMT, the methodology has changed according to the unit of translation. At the beginning of the study, translation was performed in words. However, in 2003, a translation method of multiple word bundles (i.e. phrase units), was proposed and showed better performance better than word units. The introduction of the concept of variables within the phrase is referred to "Hierarchical Phrase-Based SMT", which does not indicate a specific word, such as "eat an bread", but rather expressing with the variable X as "eat X". The superiority of this approach is that variable X can accommodate a variety of substitute words such as apple and pineapple. Prereordered-based SMT is a word order change before translation. In the case of Korean, the word order of the sentence is Subject-Object-Verb (SOV), while English is Subject-Verb-Object (SVO). If the word order is different, this is a methodology to alter the word order in accordance with the word order of the target language to be translated before proceeding with the translation. Syntax Base SMT is a translation technique that changes from "eat X" to "eat NP (Noun Phrase)" in the Hierarchical Phase-Based SMT. In other words, not all phrases can come to the candidate group, but only nouns can be placed to the candidate group, and it eliminates unnecessary translation candidates in advance zens2002phrase; koehn2009statistical.


Neural Machine Translation

NMT uses deep neural network to translation system. Based on the Sequence to Sequence model, the source language is vectorized through encoder and the latent vector is untangled through decoder to generate the target language. It is a method of utilizing deep neural network to uncover the most appropriate representations and translation results with a single pair of statements of input and output. For the text-to-text sequential modeling

sutskever2014sequence

, NMT model generally comprises encoder and decoder structure that takes input sequence and generates output sequence auto-regressively. It has been developed to Recurrent Neural Network (RNN)

cho2014learning; bahdanau2014neural

, Convolution Neural Network (CNN)

gehring2017convolutional; wu2019pay, and Transformer-based model vaswani2017attention

which outperforms other existing methods. Furthermore, fine-tuning approaches for pre-trained language models have recently shown the best performance including Cross-lingual Language Model Pre-training (XLM)

lample2019cross, Masked Sequence to Sequence Pre-training for Language Generation (MASS) song2019mass, and Multilingual BART (mBART) liu2020multilingual. In contrast, the parameters and model sizes of these pre-trained language models are extremely large for real-world industries to deploy the services. To address this issue, we present that the optimal model to proceed with the service is Transformer, considering the overall factors such as model performance, speed, and memory in recently published papers, and conduct experiments based on that model.

Category Features(Label)
Summary Analytical thinking(Analytic), Clout(Clout),
language variables Authenticity(Authentic), Emotional tone(Tone)
Words per sentence(WPS), Percent of target words captured by the dictionary(Dic),
Linguistic Percent of words in the text that are longer than six letters(Sixltr), Word count(WC),
Dimension Articles(article), Prepositions(prep), Total pronouns(pronoun), Personal pronouns(ppron),
1st pers singular(i), 1st pers plural(we), 2nd person(you),
3rd pers singular(shehe), 3rd pers plural(they), Impersonal pronouns(ipron)
Auxiliary verbs(auxverb), Common verbs(verb), Common Adverbs(adverb),
Grammars Conjunctions(conj), Negations(negate), Common adjectives(adj), Comparisons(compare),
Interrogatives(interrog), Number(number), Quantifiers(quant)
Affect process Total affect process(affect), Positive emotion(posemo),
Negative emotion(negemo), Anxiety(anx), Anger(anger), Sadness(sad)
Cognitiive process Total cognitiive process(cogproc), Insight(insight), Cause(cause),
Discrepanices(discrep), Tentativeness(tentat), Certainty(certain), Differentiation(differ)
Social process Total social process(social), Familty(family), Friends(friend),
Female referents(female), Male referents(male)
Perceptual process Total perceptual process(percept), Seeing(see), Hearing(hear), Feeling(feel)
Biological process Total biological process(bio), Body(body), Health/Illness(health),
Sexuality(sexual), Ingesting(ingest)
Drives Total drives(drives), Affiliation(affiliation), Achievement(achieve),
Power(power), Reward focus(reward), Risk focus(risk)
Time orientations Past focus(focuspast), Present focus(focuspresent), Future focus(focusfuture)
Relativity Total relativity(relativ), Motion(motion), Space(space), Time(time)
Personal concerns Work(work), Home(home), Money(money),
Leisure activities(leisure),Religion(relig), Death(death)
Informal Total informal language markers(Informal), Assents(assent),
language markers Fillers(filler), Swear words(swear), Netspeak(netspeak), Nonfluencies(nonfl)
Total punctuation(Allpunc), Semicolons(SemiC), Commas(Comma), Colons(Colon),
Punctuations Parantheses(Parenth), Question marks(QMark), Exclamation marks(Exclam),
Periods(Period), Apostrophes(Apostro), Quoatation marks(Quote)
Dashes(Dash), Other puntuation(OtherP)
Table 1: Overview of features in LIWC

2.2 AI Hub

With the advent of the fourth industrial revolutionschwab2017fourth

, inter-language exchange of information has rapidly been increased, accelerating the demand for the development of advanced translation systems. Despite the development of automatic translation systems accompanied by the growth of Information Technology (IT), there remain several difficulties in the industrial services of machine translation. The cost and time barriers of building early translation solutions exist and it is difficult to obtain quality data. Moreover, there are challenges in maintaining NMT performance quality, obstacles to obtaining domain-specific language pairs, and struggling to provide domain-specific NMT solutions. In other words, most of the difficulties are due to the lack of translation data, namely parallel corpus. Additionally, intellectual property makes it complicated to secure data, and there are numerous costs to collect, which is a major challenge for start-ups in the artificial intelligence-based industry or companies preparing for innovation

park2021study.

In general, a single corpus is relatively uncomplicated to obtain and sufficient amount can be secured, but in the case of parallel corpus, it becomes tough to acquire. Furthermore, constructing parallel corpus requires a number of high-level techniques for refining, pre-processing practical original corpus, and translating a single corpus into a desired heterogeneous language demands a lot of expenses.

To mitigate these limitations, AI Hub constructs and continuously distributes public data nationally. AI Hub is a platform that integrates AI infrastructure such as AI data, AI software, algorithms, and computing resources that are essential for developing AI technologies, products, and services. It is also releasing data related to image recognition as well as data related to machine reading, machine translation, and voice recognition. This platform contributes to the creation of an intelligence information society and an artificial intelligence industrial ecosystem including medium-sized venture enterprises, research institutes, and individuals in Korea by disclosing high-quality and high-capacity artificial intelligence data.

AI Hub has released several datasets on MT, including the high-quality Korean-English corpus released in 2019 and 2021. Subsequently, the construction of parallel corpus, including Korean-Japanese. Korean-Chinese, and other Korean-language parallel corpus has been actively established. However, close verification of these data is not being done specifically, and we seeks to proceed with quality confirmation by building a LIWC and a real-world NMT model.

2.3 Parallel Corpus Quality Assessment

Accompanied by the increase of publicly released parallel corpus such as FLORES-101goyal2021flores and AI hub, the importance of evaluating and improving the quality of the parallel corpus becomes higher. Especially for the data construction process, assessing the quality of the corpus is regarded as an essential process. For example, in the case of AI hub, all publicly available corpora were constructed through a multi-phase process that proceeded with machine translation followed by the human examination. For the corresponding examination process, semantic coherence and sentence alignment are mainly inspected. This can be viewed as checking its suitability to the intended purpose of the data construction.

However, as data acquisition becomes more accessible and the amount of data used for training increases, exploring each data with human labor leads to considerable cost. For instance, as the total amount of parallel corpus released by the AI hub is approximately 7M, it can be expected that tremendous time and cost are required to examine the whole corpus.

For the alleviation of such limitation, corpus evaluation studies were being made by mainly focusing on the minimization of direct human examination. Representative methods include the use of several translation rules established in advance espla2019paracrawl, the gale church algorithm gale1993program that evaluates the overall align of the sentence, and the Bilingual Sentence Aligner simard1998bilingual

. The validity of these corpus evaluation methodologies is generally evaluated based on the performance of the MT system generated through the corresponding corpus. In particular, evaluation criteria, such as sentence alignment, were confirmed to be effective as corpus evaluation metric through the performance verification of the SMT model generated through the corpus

abdul2012extrinsic. Furthermore, with the development of deep neural modeling, these methodologies are evaluated by the performance of NMT model espla2019paracrawl.

However, most of these studies aim to improve the performance of the MT system itself trained by the corresponding parallel corpus. These often led to inconsistent results where several data that was not considered to be noisy in training SMT system considerably deteriorated the performance of the NMT system when utilized in training process khayrallah2018impact. Thus, it shows that these corpus evaluation criteria may not be consistent enough to be directly related to actual quality assessment. In this study, we analyze the corpus using a sentence analysis tool called LIWC, which has not been utilized as a parallel corpus inspection and as an objective evaluation index for the corpus quality. Also, following previous studies, we check the performance of the NMT system trained through the parallel corpus and analyze the characteristics and quality of the corresponding corpus that can be obtained through the result.

2.4 Korean Neural Machine Translation Research

Recently, various services are being provided in South Korea as well as MT-related research. Along with Papago translation service lee2016papago which is serviced by Naver corporation, MT services are conducted by many companies and laboratories, including the Electronics and Telecommunications Research Institute (ETRI), Kakao, SYSTRAN and Genie Talk at Hancom Interfree.

Research on NMT data pre-processing is mainly being conducted in the academia of Korea University. There are several related studies including Onepiece which proposes a specialized sub-word tokenization in Korean park2021should and applying PCF to the Korean-English NMT for the first time pcj01. They also propose a methodology for training with relative ratio when configuring batch rather than simply applying back translation or copied translation when applying data augmentation pcj02. This results in higher performance than simply using back translation. In addition, based on machine translation, they have conducted various applications such as Korean spelling corrector park2020neural, English grammar corrector park2020ancient

, and cross lingual transfer learning

lee2021exploring. In conclusion, there are various experiments and studies based on the importance of pre-processing and data augmentation as well as research on NMT models.

3 Analyzing the AI Hub Corpus Using LIWC

3.1 Linguistic Inquiry and Word Count (LIWC)

LIWC is a natural-language analysis software, which allows for the investigation of various emotional, cognitive, and structural components of specific sentences pennebaker2015development. LIWC offers corpus analysis by referring to a dictionary comprising 93 features. Each feature provided is shown in Table 1. Every feature can be classified into 14 categories: summary language variables, linguistic dimensions, grammar, affect process, cognitive process, social process, perceptual process, biological process, drives, time-orientations, relativity, personal concerns, informal language markers, and punctuations. This is different from the classifications presented in the LIWC manual, which are classified into 16 categories for more intuitive analysis. We consolidate them into new categories to avoid confusion and achieve our objective. We merged auxiliary verbs, common adverbs, conjunctions, and negations of function words in grammar with Other Grammar such as conjunctions, adjectives, and so on defined in the initial categories of Manual. Pronouns, articles, and prepositions, which involve functional words, were joined with the linguistic dimension, thereby helping in the understanding of text through the rules of sentence structure. Grammar represents the grammatical components of a sentence and comprises some parts of speech. The summary language variable represents the summarized value of all the linguistic features representing the overall features of a sentence. The affect process quantifies emotions and feelings. The biological process category represents biological topics, such as body, health, and ingest in text. The drives category represents motivations and needs, which appear in text. The time-orientation category helps in the understanding of the tense used in text because LIWC contains both the tenses of verbs and general time orientations. The relativity category represents relatively-trivial topics and personal concerns as well as literal meanings of the concerned topics in text. The punctuations and informal language markers have similar meanings. In this study, we conducted an in-depth analysis with respect to the following five aspects.

First, morphological analysis can be conducted by referring to morphological features, such as grammar and linguistic dimensions. Second, the investigation of summary language variables, general descriptors, time orientation, punctuation, and informal language markers categories, enables the analysis of sentence syntax. Third, semantic analysis through the inspection of various topics, including cognitive, social, perceptual, and biological process, as well as relativity and personal concerns, can be implemented. Fourth, we can conduct sentimental analysis through the affect process, which involves positive and negative emotions. Finally, the social category, which contains male and female referents, enables the analysis of gender bias. Specifically, in the field of NMT, numerous studies have been conducted to reduce the prevalence of gender bias prates2018assessing; saunders2020reducing. In the future, this approach can be used to inspect the performance of MT systems.

LIWC is mainly leveraged in the field of psychology, especially during the investigation of linguistic characteristics revealed in the writings of psychiatric patients tausczik2010psychological; coppersmith2014quantifying. Furthermore, LIWC has been recently utilized in numerous studies on natural language processing (NLP), and its effectiveness and relevance in the field of NLP has been demonstrated. For instance, the performance of misinformation detection su2020motivations

, sentiment analysis, and plagiarism detection

garcia2020using can be improved by applying LIWC, and the effectiveness of LIWC can be evaluated through its comparison to BERT biggiogera2021bert. Following these trends, we aim to investigate all the parallel corpora for Korean released by AI Hub through morphological, semantic, sentence-syntactic, sentimental, and gender-bias aspects.

3.2 Korean-English Parallel Corpus

Figure 1: Data-domain statistics of the Korean-English Parallel Corpus.

Corpus Description

The Korean-English parallel corpus 333https://aihub.or.kr/aidata/87 is a parallel corpus from AI Hub, which was released in 2019. The Corresponding corpus was built through the cooperation of Saltlux partners 444http://saltlux.com, Flitto 555https://www.flitto.com, and Evertran 666http://www.evertran.com. The total amount of sentence pairs in the constructed corpus is 1.6M, which comprises 800K news articles, 100K website contents from the government, 100K instances of by-law data, 100K instances of Korean-based cultural contexts, 400K instances of colloquial-style data, and 100K instances of dialogic data. The ratios of each domain to the entire corpus are shown in Figure 1. It can be considered the most representative Korean-English parallel corpus, and many Korean-related studies on MT have been conducted based on that corpus moon2021filter.

For a more thorough data analysis, we conducted an in-depth investigation of this corpus with respect to various features, such as morphemes, syntax information, and the characteristics of the corpus. We analyzed such features using LIWC, and the results are shown in Table 2.

Ko-En Domain-Specialized Parallel Corpus Ko-En Parallel Corpus
Categories Features train valid total total
summary language variables Analytic 97.14 97.26 97.16 94.16
Clout 61.1 61.17 61.11 65.39
Authentic 25.7 25.37 25.66 27.76
Tone 49.14 49.08 49.13 54.48
Linguistic dimensions WC 33,750,816 4,222,191 37,973,007 38,481,936
WPS 27.33 27.29 27.33 25.39
Sixltr 26.5 26.49 26.5 25.69
Dic 76.33 76.3 76.32 79.25
function 43.78 43.78 43.78 45.24
pronoun 4.9 4.87 4.89 6.73
ppron 1.54 1.54 1.54 3.08
i 0.21 0.21 0.21 1.02
we 0.23 0.23 0.23 0.38
you 0.3 0.3 0.3 0.56
shehe 0.42 0.42 0.42 0.67
they 0.38 0.38 0.38 0.45
ipron 3.36 3.33 3.35 3.64
article 10.4 10.42 10.4 10.2
prep 15.51 15.52 15.51 15.1
grammar auxverb 6.07 6.05 6.06 6.75
adverb 2.46 2.47 2.46 2.55
conj 5.92 5.92 5.92 5.43
negate 0.63 0.62 0.63 0.7
verb 9.94 9.91 9.94 11.36
adj 4.49 4.46 4.49 4.47
compare 2.35 2.34 2.35 2.25
interrog 1.21 1.2 1.21 1.3
number 3.27 3.28 3.27 2.53
quant 1.36 1.36 1.36 1.4
affective process affect 3.78 3.75 3.77 4.03
posemo 2.49 2.48 2.49 2.75
negemo 1.24 1.23 1.24 1.23
anx 0.17 0.17 0.17 0.19
anger 0.21 0.2 0.21 0.29
sad 0.3 0.3 0.3 0.28
social process social 5.07 5.04 5.06 6.7
family 0.23 0.23 0.23 0.25
friend 0.12 0.12 0.12 0.15
female 0.18 0.17 0.18 0.34
male 0.46 0.47 0.46 0.63
cognitive process cogproc 6.75 6.73 6.74 7.48
insight 1.56 1.56 1.56 1.75
cause 1.69 1.68 1.69 1.73
discrep 0.77 0.77 0.77 0.99
tentat 1.14 1.14 1.14 1.41
certain 0.65 0.65 0.65 0.7
differ 1.92 1.91 1.92 1.96
perceptual process percept 1.53 1.53 1.53 1.82
see 0.58 0.58 0.58 0.68
hear 0.48 0.48 0.48 0.65
feel 0.32 0.32 0.32 0.35
Biological process bio 2.31 2.31 2.31 1.63
body 0.46 0.46 0.46 0.46
health 1.35 1.36 1.35 0.67
sexual 0.04 0.04 0.04 0.05
ingest 0.51 0.51 0.51 0.46
drives drives 7.39 7.4 7.4 8.26
affiliation 1.52 1.52 1.52 1.77
achieve 1.92 1.92 1.92 1.96
power 3.33 3.33 3.33 3.94
reward 1.02 1.02 1.02 1.05
risk 0.79 0.78 0.79 0.63
time-orientations focuspast 3.27 3.25 3.26 3.4
focuspresent 5.79 5.78 5.79 6.63
focusfuture 0.97 0.96 0.96 1.49
relativivity relativ 14.53 14.55 14.53 14.29
motion 1.72 1.72 1.72 1.81
space 8.31 8.32 8.31 7.83
time 4.59 4.6 4.59 4.73
personal concerns work 5.64 5.64 5.64 6.37
leisure 1.52 1.52 1.52 1.43
home 0.52 0.53 0.52 0.5
money 2.19 2.18 2.19 1.87
relig 0.23 0.23 0.23 0.29
death 0.14 0.14 0.14 0.14
informal language informal 0.23 0.23 0.23 0.26
swear 0 0 0 0.01
netspeak 0.1 0.11 0.1 0.1
assent 0.06 0.06 0.06 0.09
nonflu 0.08 0.08 0.08 0.09
filler 0 0 0 0
punctuations AllPunc 14.72 14.72 14.72 14.68
Period 3.81 3.81 3.81 4.33
Comma 6.23 6.23 6.23 5.24
Colon 0.03 0.03 0.03 0.08
SemiC 0.01 0.01 0.01 0.01
QMark 0.01 0.01 0.01 0.22
Exclam 0 0 0 0
Dash 1.88 1.89 1.89 1.6
Quote 1.11 1.1 1.11 1.16
Apostro 0.85 0.85 0.85 1.11
Parenth 0.56 0.57 0.56 0.7
OtherP 0.22 0.23 0.22 0.23
Table 2: LIWC results of Korean-English Domain-Specialized Parallel and Korean-English Parallel Corpus.

Corpus Analysis

For the morphological analysis, we conducted morphological analysis through the linguistic dimension and the grammar of linguistic-feature results obtained using LIWC. In the linguistic dimension category, it shows high frequency in ‘prepositions (prep)’ and ‘articles (article)’ of the total part-of-speech, at 15.1% and 10.2%, respectively. Additionally, the ‘auxiliary verb (auxverb)’ in the grammar category appears for 6.75%, and its prevalence is higher than that of others, such as ‘commonverb (verb),’ which shows the highest frequency, at 11.36%. This result indicates the continuous prevalence of be verbs (am, are, is, was, and were), and the perfect tense of English characteristics, such as diverse tenses and conjugations, was reflected as data, rather than a base form of a verb.

In personal-pronoun analysis, which consists of {‘1st pers singular (i)’, ‘1st pers plural (we)’, ‘2nd person (you)’, ‘3rd pers singular (shehe)’, ‘3rd pers plural (they)’, and ‘impersonal pronouns (ipron)’}, the frequency of impersonal pronouns (ipron) is similar to that of ‘personal pronouns (ppron)’, and the ‘1st pers singular(i)’ has the highest frequency as one of the personal pronouns. The ‘2nd person’ and ‘3rd pers singular’ pronouns also came next in order of prevalence. Unlike other corpora in which impersonal pronouns are predominant, this corpus includes both colloquial and dialogic sentences because it comprises interactive conversations between the first-person perspective and other person perspectives.

Syntactic analysis is defined as an analytic approach that informs us on the grammatical meaning of specific sentences or parts of such sentences. This approach avails the type of tone and atmosphere used in sentences through the summary language variables category. We also obtain the sentence length from ‘word count (wc)’, ’word per sentence (wps)’, and lengthy word count using ‘Sixltr’. We explore the sentences based on whether they are represented using statements, questions, or quotations using the punctuation category. We use the time orientations category to understand the point of view, and we investigate the ’assents (assent)’, ‘fillers (filler)’, ‘swear words (swear)’ in the informal language markers category. In the end, these features contribute to understanding the syntactic information of the corpus.

We show the ‘thinking (analytic)’, ‘clout (clout)’ (i.e., the representation of trust), ‘authenticity (authentic)’ (i.e., the representation of sincerity), and ‘emotional tone (tone)’ as 94.16%, 65.39%, 27.76%, and 54.48%, respectively. We can find that the prevalence of ‘analytic’ is relatively low whereas that of ‘clout’, ‘authentic’, and ‘tone’ is relatively high. This reflects the characteristics of each component in the corpus that contains various descriptive styles (i.e. colloquial or literary), rather than focusing on conveying and explaining specific domain knowledge. Various descriptive styles of this corpus can also be found by inspecting punctuation category which shows relatively high appearance rate of ‘question marks (QMark)’.

In the results of the linguistic dimensions category, we establish that the average word count of a sentence is 25.39, and ‘Sixltr’ accounts for 25% of the total word count. The articles and prep categories also account for a large proportion at 10.4% and 15.5%, respectively, and as a result of the analysis of the time orientations category, the ratios are high in the order of present focus, past focus, and future focus. Additionally, in the grammar category, words that directly represent ‘number (number)’ are used approximately twice as often as ‘quantifiers (quant)’ representing quantities.

In the informal language mark category, the values are higher than those of the other categories. It is noteworthy that ‘swear words’, such as ‘damn’ and ‘shit,’ and ‘fillers,’ such as ‘you know’ and ‘i mean,’ which are used as interludes between conversations, are close to zero in other corpora results. It seems that this is because the corpus contains written, colloquial, and dialogue language, unlike general data, which consists only of written or spoken language.

In semantic analysis, semantics is the study of analyzing meaning in units of text, sentences, and phrases. In this study, we attempt to understand the corpus in depth by checking which of the various topics, such as drives and biological process, has a high ratio.

In the corpus, the relativity category corresponding to a relatively trivial topic was found to be at the highest level at 14.29%, and ‘space (space)’ accounted for the highest prevalence at 7.83%. Next are the drives, cognitive process, personal concerns, and social process categories in that order. Specifically, in the personal concerns category, ‘work (work)’ occupied more than half. This result is because, unlike other corpora, this corpus includes colloquial words and dialogues. For this reason, the relatively trivial topic, which is the topic of conversation, and the individual’s sense of purpose, thoughts, and interests are relatively clearly revealed.

Sentiment analysis involves analyzing the degree of positivity and negativity appearing in text. LIWC supports the analysis of the scale of ‘positive emotion (posemo)’ and ‘negative emotion (negemo), such as ‘anger,’ ‘anxiety,’ and ‘sadness.’

‘posemo’ occurs twice as much as ‘negemo’ in this corpus. Specifically, as a whole, words expressing emotions were used the most compared to other data, thereby revealing the characteristics of the corpus, which includes dialogues and spoken words.

Gender bias is an important factor when it comes to determining the quality of MT. The results of ‘Female referents (female)’ and ‘Male referents (male)’ in the social process category represent the feature of the level of being referent in the text. ‘Male’ appears at a frequency of 0.63%, which is twice that of ‘female’ at 0.34%. Although the gender balance of the corpus was not effectively achieved, it had the same results as the number of male and female referents presented through the LIWC average analysis results of various corpora, such as blogs, novels, and Twitter in the LIWC 2015 manual.

3.3 Korean-English Domain-Specialized Parallel Corpus

Corpus Description

The Korean-English Domain-Specialized Parallel Corpus 777https://aihub.or.kr/aidata/7974 provides various parallel corpora specializing in several domains. This corpus was released in 2021, and three companies cooperated in its construction: Saltlux partners, Flitto, and Evertran.

Figure 2: Data-domain statistics of the Korean-English Domain-Specialized Parallel Corpus.

The corresponding corpus consists of 1.5M sentence pairs, including 250K instances of medical/health data, 200K instances of financial/stock market data, 100K instances of parent-notices data, 200K instances of international sport events data, and 100K instances of IT technology data, 200K instances of festival event content data, 150K judicial precedents, and 200K instances of data on traditional culture/food. The percentage of the domain data within the entire corpus is shown in Figure 2.

Corpus Analysis

This corpus was released separately into training and validation datasets, and Table 2 shows the LIWC results of each dataset. The differences of the linguistic features between the training and validation datasets are generally less than 0.1%. Although there are exceptional differences in some summary-language variables, such as ‘clout,’ which means confidence, and ‘authentic,’ which means authenticity, there are rarely any differences in these features because the differences are at approximately 1%. Therefore we can conclude that the training and validation datasets are released in balance.

In morphological analysis, there are generally similar results to those of Section 3.2. However, impersonal pronouns are used twice as much as personal pronouns. Although all the results show that the use of personal pronouns is low, "i" and "you" show the lowest results in other corpora.

Focusing on the syntactic category, ‘analytic’, which represents analytic thinking in text, is highest in four summary language variables. The length of the sentences is the longest, and ‘commas’ is the most frequently used in all the Korean-English corpora we analyzed. This is because there are many long sentences with several phrases explaining the domain-specialized concepts in such corpora. Time orientation is still the highest in the present and the lowest in the future. However, the difference between the present and the past tense is smallest in all corpora because the data contain past cases, such as judicial precedents and financial/stock market data.

Additionally, in semantics, the biological process category has the highest prevalence in the entire corpus. In this category, ‘health/illness (health)’ is especially high because this corpus contains several domains that include both medical and international sport data. The results of ‘money (money)’ and ‘leisure activities (leisure)’ in the personal concerns category prove that there are international sports data and financial/stock market data in this corpus.

In the results of the sentimental analysis, ‘posemo’ appears twice as much as ‘negamo’. Additionally, the use of words representing sentiments is 7% lower than that presented in Section 3.2. Finally, male referents and female referents of this corpus allow for noticing gender bias because male referents are two times more than female referents unlike other corpora.

3.4 Korean-English Parallel Corpus (Technology)

Corpus Description

AI Hub released the technology-science domain-specialized Korean-English translation corpus 888https://aihub.or.kr/aidata/30719 in 2021 through the cooperation of Twigfarm 999https://twigfarm.net/, Lexcode 101010https://lexcode.co.kr/, Naver 111111https://www.naver.com, the Korean telecommunications technology association (TTA) 121212https://www.tta.or.kr/, and the Fun & Joy company (FNJ) 131313http://www.fnj.or.kr/home/index.html. The corresponding corpus was constructed for the support of ICT companies with respect to the translation of technical documents or product localization.

Figure 3: Data-domain statistics of the Korean-English Parallel Corpus (Technology).

The number of sentence pairs in the entire corpus is 1.5M, which comprises five domains, as shown in Figure 3: 350K instances of ICT domain data, 150K instances of electricity domain data, 150K instances of electronic domain data, 350K instances of mechanical domain data, and 500K instances of medical domain data. For the construction of the high-quality corpus, expert-level revision by ICT professionals and several professors in translation fields was conducted after the initial corpus was compiled using a computer system. In this study, we partially leveraged 788K sentence pairs (ICT (35.2%), mechanic (31.2%), electricity (13.8%), electronic (11.2%), and medical (8.6%)) because all the data is yet to be released.

Corpus Analysis

The entire corpus comprises training and validation datasets, and the LIWC results are shown in Table 3.

The LIWC results show that there exist small differences between the training and validation datasets in terms of each feature value. We can also establish that the corresponding corpus contains more adverbs than other parallel corpora, and few personal pronouns have been used to the extent that impersonal pronouns (‘ipron’) have been used approximately 10 times more than personal pronouns. This allows us to identify the characteristics of the corpus that describe technology and phenomena compared to the corpora of other fields that explore people and culture.

Because the ‘Sixltr’ rate is relatively high, whereas ‘WPS’ is low, we can infer that short sentences, each of which consist long words, are mainly contained in the corpus. Furthermore, the corpus shows low ‘authentic’ and emotional tone (‘tone’) rates. This indicates that by considering the characteristics of the technology domain, the representations of each sentence are concise. The present tense appears more frequently than the past or future tenses, thereby supporting the attributes of the technology domain that are mainly targeted to describe current technology, future complementary points, and expectations.

Compared to other corpora, the prevalence of the drive category, which represents the motivation of a sentence, is relatively low, whereas the biological processes category remains the most frequent. These results are contrary to those of the social domain corpus, which reflects the characteristics of the corresponding corpus.

The ratio of the conventional process to that of the perceptual process category was also the highest throughout all the corpora. Therefore, it can be confirmed that the characteristics of articles in the technology domain are presented in a manner that describes perceptions and cognitive processes, such as ‘insight,’ ‘causation,’ and ‘certainty,’ about technology. One notable characteristic is the absence of gender bias. The ratio of the domains can vary because all the 1.5M sentences are completely constructed. Therefore, we can obtain the linguistic features of each domain by comparing the entire corpus to the present corpus in latter experiments.

3.5 Korean-English Parallel Corpus (Social Science)

Corpus Description

Similar to the technology-science specialized corpus, the social science specialized Korean-English parallel corpus141414https://aihub.or.kr/aidata/30720 was also published in 2021 through the cooperation of Twigfarm, Lexcode, Naver, TTA, and FNJ.

Figure 4: Data-domain statistics of the Korean-English Parallel Corpus (Social Science).

The entire corpus comprises 1.5M sentence pairs, including 300K instances of economic data, 90K instances of cultural content, 100K instances of tourism content, 400K instances of education data, 500K instances of law data, and 110K instances of art domain content. The occupational ratio of each domain to the corresponding corpus is shown in Figure 3. The data was revised by the specialists of their domain and translation experts. In this study, we partially leveraged 537K sentence pairs (law (37.2%), economy (24.2%), education (24.1%), tourism (5.9%), culture (4.5%), art (4%), and medical(0.08%)) because all the data is yet to be released.

Corpus Analysis

The data was also split into the training and validation datasets, and the results of running LIWC on the training, validation, and the entire dataset are shown in the Table 3. The difference between each linguistic feature of the training and validation datasets is mostly within 0.1%, and as an exception, some features, such as ‘authentic’ and ‘emotional tone,’ have relatively sizable dissimilarities. This is as a result of the incidental blending of datasets with various domains, such as law, culture, and economy.

First, the morphological analysis is homogeneous to the results presented in Section 3.2, but negates serving as not and never have been used the most among other corpora. This result is contrary to the outcomes in the technical science corpus discussed in Section 3.4, thereby suggesting that the frequency of plain and negative statements varies depending on the domain. In the case of pronouns, impersonal pronouns (‘ipron’) have been used four times more than personal pronouns (‘ppron’), as shown in the results of Section 3.2, which is caused by the characteristics of the corpus in which the written sentences account for the description of most of the objects.

Considering syntactic characteristics, the number of analytic explanations is relatively high, thereby indicating that this corpus logically describes domains, such as economics, law, and education. There is a higher ratio of ‘Sixltr’ compared to other corpora, thereby demonstrating the increased use of average long words. The low ‘WPS’ also suggests that short sentences have been used. We also find that future focus (‘focusfuture’) accounts for the smallest percentage in the time orientations category. The reason behind this output is that there exist present state-oriented explanations rather than future predictions in the social and cultural corpora.

As a characteristic of the semantic perspective, the biological process category shows the lowest score compared to that of other corpora. Specifically, ‘body (body)’ figures are within 0.2% as a result of the nature of the social science domain, which is far from biology-related topics. Additionally, the cognitive process of human thinking is the highest compared to that of other corpora, with 1.5 to 1.8 times higher insight and cause. This confirms that ‘insight (insight)’ and ‘cause (cause)’ are attributes of written sentences in the social science domain in contrast to the technical science domain and specialized fields. The prevalence of ‘posemo’ is twice as high as that of ‘negemo’. Subsequently, considering the phenomenon of gender bias, male-related pronouns are approximately twice as much as female-related pronouns, similar to other corpora.

Similarly, because this corpus has a different ratio of domains from the dataset that would be completed at 1.5 million, as described in Section 3.4, we can infer and analyze the linguistic features of each domain by comparing the results of the dataset that will be updated to 1.5 million.

Ko-En Parallel Corpus (Social Science) Ko-En Parallel Corpus (Technology)
Categories Features train valid total train valid total
summary language variables Analytic 97.38 97.38 97.38 99 99 99
Clout 53.86 53.87 53.86 49.13 49.17 49.13
Authentic 23.99 24.26 24.02 16.69 16.49 16.67
Tone 47.66 47.91 47.69 34.02 33.99 34.02
Linguistic dimensions WC 9,991,408 1,250,115 11,241,523 13,621,209 1,702,722 15,323,931
WPS 20.77 20.79 20.77 17.87 17.87 17.87
Sixltr 30.65 30.61 30.65 29.02 29.03 29.02
Dic 80.71 80.71 80.71 67.19 67.17 67.19
function 46.81 46.81 46.81 42.15 42.19 42.15
pronoun 5.32 5.31 5.32 2.04 2.06 2.04
ppron 0.96 0.96 0.96 0.14 0.15 0.14
i 0.14 0.14 0.14 0.03 0.03 0.03
we 0.22 0.22 0.22 0.02 0.02 0.02
you 0.1 0.1 0.1 0.02 0.01 0.02
shehe 0.16 0.17 0.16 0.01 0.02 0.01
they 0.34 0.34 0.34 0.06 0.07 0.06
ipron 4.35 4.34 4.35 1.9 1.91 1.9
article 11.42 11.42 11.42 14.95 14.95 14.95
prep 16.14 16.13 16.14 13.2 13.23 13.2
grammar auxverb 7.23 7.21 7.23 7.68 7.7 7.68
adverb 2.44 2.42 2.44 1.35 1.34 1.35
conj 5.41 5.44 5.41 3.77 3.76 3.77
negate 0.85 0.86 0.85 0.3 0.3 0.3
verb 10.3 10.25 10.29 9.51 9.53 9.51
adj 4.73 4.73 4.73 3.32 3.32 3.32
compare 2.6 2.57 2.6 2.15 2.15 2.15
interrog 0.88 0.88 0.88 0.54 0.54 0.54
number 1.89 1.87 1.89 6.28 6.27 6.28
quant 1.58 1.58 1.58 1.63 1.62 1.63
affective process affect 3.78 3.79 3.78 1.73 1.73 1.73
posemo 2.45 2.46 2.45 1.09 1.09 1.09
negemo 1.27 1.26 1.27 0.62 0.62 0.62
anx 0.21 0.2 0.21 0.15 0.15 0.15
anger 0.24 0.25 0.24 0.05 0.05 0.05
sad 0.22 0.22 0.22 0.23 0.23 0.23
social process social 4.27 4.28 4.27 1.83 1.84 1.83
family 0.12 0.13 0.12 0.05 0.05 0.05
friend 0.06 0.07 0.06 0.09 0.09 0.09
female 0.1 0.09 0.1 0.04 0.04 0.04
male 0.18 0.19 0.18 0.03 0.03 0.03
cognitive process cogproc 11.14 11.13 11.14 9.6 9.59 9.6
insight 3.34 3.34 3.34 2.09 2.07 2.09
cause 2.78 2.76 2.78 2.29 2.3 2.29
discrep 1.06 1.06 1.06 0.35 0.34 0.35
tentat 1.89 1.89 1.89 3.71 3.72 3.71
certain 1.09 1.09 1.09 0.45 0.44 0.45
differ 2.63 2.64 2.63 1.72 1.72 1.72
perceptual process percept 1.16 1.17 1.16 2.17 2.17 2.17
see 0.56 0.56 0.56 1.37 1.37 1.37
hear 0.26 0.25 0.26 0.2 0.2 0.2
feel 0.18 0.19 0.18 0.42 0.42 0.42
Biological process bio 0.8 0.81 0.8 1.42 1.42 1.42
body 0.18 0.19 0.18 0.41 0.41 0.41
health 0.45 0.44 0.45 0.8 0.79 0.8
sexual 0.03 0.03 0.03 0.03 0.03 0.03
ingest 0.15 0.15 0.15 0.22 0.22 0.22
drives drives 7.95 8.02 7.96 4.59 4.58 4.59
affiliation 1.3 1.32 1.3 0.66 0.66 0.66
achieve 1.87 1.9 1.87 1.46 1.46 1.46
power 3.87 3.91 3.87 2.08 2.08 2.08
reward 0.83 0.84 0.83 0.37 0.37 0.37
risk 0.83 0.82 0.83 0.34 0.34 0.34
time-orientations focuspast 2.66 2.66 2.66 1.34 1.36 1.34
focuspresent 6.98 6.92 6.97 6.15 6.14 6.15
focusfuture 0.76 0.76 0.76 2.85 2.85 2.85
relativity relativ 11.88 11.93 11.89 11.72 11.69 11.72
motion 1.49 1.52 1.49 1.43 1.42 1.43
space 7.34 7.36 7.34 7.29 7.29 7.29
time 3 3.01 3 3.16 3.15 3.16
personal concerns work 8.42 8.41 8.42 2.25 2.26 2.25
leisure 0.76 0.76 0.76 0.44 0.42 0.44
home 0.32 0.32 0.32 0.28 0.28 0.28
money 3.2 3.17 3.2 0.38 0.38 0.38
relig 0.13 0.12 0.13 0.02 0.02 0.02
death 0.08 0.08 0.08 0.04 0.05 0.04
informal language informal 0.13 0.13 0.13 0.18 0.18 0.18
swear 0.01 0.01 0.01 0.03 0.03 0.03
netspeak 0.06 0.06 0.06 0.12 0.12 0.12
assent 0.02 0.02 0.02 0.01 0.01 0.01
nonflu 0.05 0.04 0.05 0.04 0.04 0.04
filler 0 0 0 0 0 0
punctuations AllPunc 11.75 11.76 11.75 11.17 11.16 11.17
Period 4.81 4.81 4.81 5.65 5.65 5.65
Comma 4.66 4.66 4.66 3.54 3.54 3.54
Colon 0.01 0.01 0.01 0.01 0.01 0.01
SemiC 0 0.01 0 0 0 0
QMark 0.04 0.04 0.04 0 0 0
Exclam 0 0 0 0 0 0
Dash 0.83 0.84 0.83 0.82 0.8 0.82
Quote 0.12 0.12 0.12 0.02 0.02 0.02
Apostro 0.63 0.65 0.63 0.12 0.12 0.12
Parenth 0.37 0.37 0.37 0.69 0.7 0.69
OtherP 0.26 0.26 0.26 0.31 0.31 0.31
Table 3: LIWC results of Korean-English Parallel(Social Science) and Korean-English Parallel(Technology) Corpus.

3.6 Korean-Chinese Parallel Corpus (Technology)

Corpus Description

AI Hub also provided the technology-domain specialized Korean-Chinese parallel corpus 151515https://aihub.or.kr/aidata/30722. This corpus is the first publicly-released Korean-Chinese parallel corpus. To build this corpus, six companies, including Saltlux partners, Flitto, Evertran, Onasia 161616https://on-asialang.com/, Yoon’s information development company, and dmtlabs 171717http://dmtlabs.co.kr/ cooperated.

Figure 5: Data-domain statistics of the Korean-Chinese Parallel Corpus (Technology).

The entire corpus comprises 1.3M sentence pairs, including 250K instances of medical/health data, 150K instances of patent/technology data, 300K instances of car/traffic/material data, and 600K instances of IT/computer/mobile-related content. Figure 5 shows the ratio of each domain. This corpus is subdivided into the training and validation datasets, and Table 4 shows the LIWC analysis results of both datasets.

Owing to the characteristics of the Chinese language, there exist differences between the training and validation datasets in count-based analyses, such as ‘WPS’ and ‘Sixltr’, but their severities are subtle. We inspect the linguistic characteristics of English and Chinese by comparing them between those of the technology-specialized Korean-English parallel corpus, which is analyzed in Section 3.4.

Corpus Analysis

In the morphological analysis, unlike Section 3.4’s result where ‘common verb’ appears 1.7 times more than ‘auxverb,’ this corpus shows the lowest difference between the two features at 1.45 times. Additionally, the results indicate that it rarely uses the negative representations, ‘quant’ and ‘negate’. We establish that this corpus shows notable differences in pronoun analysis. Unlike Section 3.4, where personal pronouns were rarely used, personal pronouns were used approximately three times as often as non-personal pronouns, and among them, the ‘1st pers plural (we)’ was the most common, and the ‘3rd pers pronouns’ (i.e., 3rd pers singular and plural) were rarely used.

Due to the nature of the data in technical field, declarative texts take large portion of the whole dataset, and thereby ‘semicolons (SemiC),’ ‘Colons (Colon),’ ‘Dashes (Dash),’ and ‘QMark’ were rarely used as demonstrated in Section 3.4.

In the aspect of syntactic analysis, ‘analytic’ and confidence, ‘Clout,’ were the highest throughout all the English corpora we analyzed. As primary purpose of the data in technology-domain is to convey existing information that proposed priorly, the present tense is less focused than the past and future tenses.

Notably, ‘analytic’ was 5.9% lower than that presented in Section 3.4’s ‘analytic,’ and ‘WPS’ and ‘Sixltr’ were much higher. These results show that the length of sentences and words used in Chinese are longer than those used in English. Additionally, unlike most Korean-English parallel corpora, including those presented in Section 3.4, ‘article’ and ‘prep’ are scarcely used, and the use of all tenses in the time orientations category with the exact weight is also a characteristic of Chinese.

In the punctuations category, the usage frequency of ‘colon’ is similar to that presented in Section 3.4’s results. This is a characteristic that explains the existence of multiple contents in one sentence. Additionally, ‘number,’ which directly represents a number, was higher than ‘quant,’ which represents a quantitative description. However, informal language markers were rarely used. It is noteworthy that ‘quotes,’ which were hardly used in Korean-English parallel corpora, accounted for 12.7%. This result suggests that the presence of many quotations in this corpus show the differences between the English corpus and the Chinese corpus.

Considering the semantic aspects, ‘work’ and ‘leisure’ have the highest ratios in the personal concerns category. In addition, the perception process, biological process, and cognitive process categories were higher than in other Korean-Chinese corpora, which is similar to the results presented in Section 3.4.

Through the qualitative inspection, we conclude that the semantic difference between the corpora with similar domains is identical, except morphological and syntactic distinction of each respective language.

Overall, in sentiment analysis, all outcomes in the affective process category, including emotional tone in the summary language variables category, are low. This is because, as discussed in Section 3.4, it consists of a sentence-oriented corpus that describes knowledge and phenomena. The unusual thing is that in the corpus, ‘posemo’ appeared approximately six times more than ‘negemo’, which is similar to the results presented in Section 3.4.

In the view of gender bias, most of the Korean-English parallel corpora analyzed so far had gender bias, but there was no gender bias in all the Chinese corpora.

3.7 Korean-Chinese Parallel Corpus (Social Science)

Corpus Description

Along with the technology-domain specialized corpus, AI Hub also released a social science-domain specialized Korean-Chinese parallel corpus 181818https://aihub.or.kr/aidata/30721. To build this corpus, six companies, including Saltlux partners, Flitto, Evertran, Onasia, Yoon’s information development company, and dmtlabs, cooperated.

Figure 6: Data-domain statistics of the Korean-Chinese Parallel Corpus (Social Science).

The total amount of sentence pairs in the corpus is 1.3M, including 200K instances of financial/stock market contents, 200K instances of social/welfare domain data, 100K instances of education data, 150K instances of cultural heritage/local/K-food content, 250K by-law texts, 250K instances of political/administration data, and 200K instances of K-POP/culture content. The ratio of each domain to the entire corpus is shown in Figure 6.

Corpus Analysis

As shown in Table 4, the overall characteristics of the training and validation datasets are almost identical. The overall analysis results are generally similar to the results presented in Section 3.6, except for a few aspects. The corresponding corpus showed a similar ratio of ‘conjunctions (conj),’ ‘negations (negate),’ ‘comparisons (compare),’ and ‘interrogatives (interrog)’ in the grammar category. Through the inspection of syntactic analysis, we established that the relatively frequent ‘preposition,’ ‘comma,’ ‘question mark (Qmark),’ and ‘quote’ are contained in each sentence. This shows that the length of each sentence is quite short, and the proportion of ‘questions’ and ‘quotes’ is relatively high. We can infer that descriptive methods that sequentially list various types of information have been commonly used.

We can point out a common feature with Section 3.5 because personal pronouns are used three times as much as non-personal pronouns. However, the corresponding corpus has a distinguishable feature in that more first and second person singular pronouns are used more frequently than first person plural pronouns. These results show the attributes of the domain of the corresponding corpus, where the descriptions of social/culture/politics, which mainly focus on "I" and "You," are composed.

Furthermore in the corpus, the prevalence of ‘posemo’ is higher than that of ‘negemo’ and gender bias rarely exists. Later, by analyzing the linguistic and colloquial Chinese corpora in various fields, we verified whether it is a linguistic characteristic of Chinese or a special case occurring during descriptions in a specialized field.

Ko-Zh Parallel corpus (Social Science) Ko-Zh Parallel Corpus (Technology)
Categories Features train valid total train valid total
summary language variables Analytic 93.25 93.24 93.25 93.15 93.09 93.14
Clout 50.44 50.49 50.45 51.73 52.14 51.78
Authentic 1 1 1 1 1 1
Tone 26.97 27.05 26.98 29.18 29.75 29.25
Linguistic dimensions WC 3,907,897 482,441 4,390,338 3,954,651 512,873 4,467,524
WPS 598.73 724.39 612.54 2569.62 2947.55 2613.01
Sixltr 66.89 66.68 66.87 66.35 64.84 66.18
Dic 1.02 1.07 1.03 3.27 3.81 3.33
function 0.21 0.21 0.21 0.55 0.67 0.56
pronoun 0.07 0.08 0.07 0.27 0.35 0.28
ppron 0.06 0.06 0.06 0.2 0.25 0.21
i 0.03 0.03 0.03 0.05 0.06 0.05
we 0.01 0.01 0.01 0.01 0.01 0.01
you 0.02 0.02 0.02 0.15 0.19 0.15
shehe 0 0 0 0 0 0
they 0 0 0 0 0 0
ipron 0.02 0.02 0.02 0.07 0.09 0.07
article 0.06 0.05 0.06 0.08 0.09 0.08
prep 0.05 0.05 0.05 0.15 0.17 0.15
grammar auxverb 0.01 0.02 0.01 0.02 0.03 0.02
adverb 0.01 0.01 0.01 0.02 0.03 0.02
conj 0.01 0.01 0.01 0.04 0.04 0.04
negate 0.01 0.01 0.01 0 0 0
verb 0.08 0.08 0.08 0.16 0.17 0.16
adj 0.07 0.08 0.07 0.27 0.3 0.27
compare 0.01 0.01 0.01 0.02 0.02 0.02
interrog 0.01 0.01 0.01 0.02 0.02 0.02
number 1.45 1.41 1.45 1.33 1.16 1.31
quant 0.01 0.01 0.01 0.04 0.04 0.04
affective process affect 0.12 0.13 0.12 0.29 0.35 0.3
posemo 0.1 0.1 0.1 0.25 0.29 0.25
negemo 0.02 0.03 0.02 0.04 0.06 0.04
anx 0 0 0 0 0.01 0
anger 0 0.01 0 0.02 0.02 0.02
sad 0.01 0.01 0.01 0.01 0.01 0.01
social process social 0.13 0.14 0.13 0.34 0.41 0.35
family 0.01 0.01 0.01 0.01 0.01 0.01
friend 0.01 0.02 0.01 0.04 0.04 0.04
female 0.02 0.01 0.02 0.02 0.03 0.02
male 0.02 0.02 0.02 0.02 0.02 0.02
cognitive process cogproc 0.05 0.06 0.05 0.2 0.24 0.2
insight 0.02 0.02 0.02 0.09 0.11 0.09
cause 0.02 0.02 0.02 0.08 0.11 0.08
discrep 0 0.01 0 0 0.01 0
tentat 0.01 0.01 0.01 0.01 0.01 0.01
certain 0.01 0.01 0.01 0.03 0.03 0.03
differ 0 0 0 0.01 0.01 0.01
perceptual process percept 0.09 0.09 0.09 0.22 0.25 0.22
see 0.04 0.04 0.04 0.11 0.13 0.11
hear 0.03 0.03 0.03 0.05 0.06 0.05
feel 0.01 0.01 0.01 0.03 0.03 0.03
Biological process bio 0.06 0.07 0.06 0.18 0.21 0.18
body 0.01 0.01 0.01 0.04 0.05 0.04
health 0.02 0.02 0.02 0.08 0.1 0.08
sexual 0 0 0 0.01 0.01 0.01
ingest 0.02 0.02 0.02 0.06 0.06 0.06
drives drives 0.16 0.18 0.16 0.57 0.71 0.59
affiliation 0.06 0.06 0.06 0.22 0.27 0.23
achieve 0.03 0.03 0.03 0.11 0.14 0.11
power 0.07 0.08 0.07 0.2 0.26 0.21
reward 0.02 0.02 0.02 0.06 0.07 0.06
risk 0 0 0 0.03 0.04 0.03
time-orientations focuspast 0.01 0.01 0.01 0.02 0.03 0.02
focuspresent 0.06 0.07 0.06 0.14 0.16 0.14
focusfuture 0.01 0.01 0.01 0.02 0.02 0.02
relativivity relativ 0.17 0.18 0.17 0.65 0.72 0.66
motion 0.03 0.04 0.03 0.13 0.15 0.13
space 0.09 0.08 0.09 0.35 0.38 0.35
time 0.05 0.06 0.05 0.17 0.19 0.17
personal concerns work 0.11 0.12 0.11 0.52 0.56 0.52
leisure 0.11 0.11 0.11 0.35 0.49 0.37
home 0.01 0.01 0.01 0.06 0.06 0.06
money 0.06 0.06 0.06 0.26 0.24 0.26
relig 0.01 0.01 0.01 0.02 0.02 0.02
death 0 0 0 0.01 0.02 0.01
informal language informal 0.09 0.09 0.09 0.32 0.41 0.33
swear 0 0 0 0 0.01 0
netspeak 0.07 0.08 0.07 0.31 0.4 0.32
assent 0.03 0.03 0.03 0.03 0.03 0.03
nonflu 0.01 0.01 0.01 0.01 0.01 0.01
filler 0 0 0 0 0 0
punctuations AllPunc 66.01 63.97 65.79 63.19 63.24 63.2
Period 0.15 0.15 0.15 0.1 0.12 0.1
Comma 42.3 41.26 42.19 38.9 37.37 38.72
Colon 1.09 0.94 1.07 1.03 0.82 1.01
SemiC 0.01 0.01 0.01 0.01 0.01 0.01
QMark 0.18 0.13 0.17 0.05 0.05 0.05
Exclam 0.05 0.06 0.05 0.02 0.02 0.02
Dash 0.52 0.52 0.52 1.14 1.15 1.14
Quote 15.52 14.31 15.39 12.68 12.89 12.7
Apostro 0.52 0.46 0.51 0.57 0.5 0.56
Parenth 3.79 4.28 3.84 6.65 8.53 6.87
OtherP 1.89 1.86 1.89 2.05 1.79 2.02
Table 4: LIWC results of Korean-Chinese(Zh) Parallel(Social Science) and Korean-Chinese(Zh) Parallel(Technology) Corpus.

3.8 Korean-Japanese Parallel Corpus

AI Hub released the public Korean-Japanese parallel corpus 191919https://aihub.or.kr/aidata/30723 for the first time in Korea. Each sentence pair in the corpus is generated by translating Korean sentences from various domains into Japanese sentences using MT systems, after which it is revised by human experts. This corpus is not biased to a specific industrial domain and is constructed from a raw data source. Therefore, it is free from copyright problems. These attributes enable the corpus to be widely utilized for any NLP industrial services that deal with various domains. To build this corpus, six companies, including Saltlux partners, Flitto, Evertran, Onasia, Yoon’s information development company, and dmtlabs, cooperated.

Figure 7: Data-domain statistics of the Korean-Japanese Parallel Corpus.

The entire corpus comprises 1.3M sentence pairs, including 150K instances of cultural heritage/local/K-food content, 200K instances of K-POP/culture content, 200K instances of IT/computer/mobile domain data, 200K instances of finance/stock market contents, 200K instances of social/welfare data, 100K instances of education data, 150K instances of patent/technology domain data, 100K instances of medical/health content, and 200K instances of car-related data. The ratio of each domain to the entire corpus is shown in Figure 6. As the proper LIWC software has not been publicly released, we skipped the corpus analysis for the corresponding corpus.

4 Experiments and Results

4.1 Dataset Details

In this study, we utilize seven types of Korean parallel corpora released by AI Hub as training data for the experiments. We measure the total number of sentences for each corpus, the minimum, maximum, and average length for each word, and the character unit. Statistics for the seven newly released parallel corpora by AI Hub are listed in Table 5. In the case of the social science and technology fields of the Korean-English parallel corpus, only 470K and 690K instances of data were released owing to unintentional circumstances of the organizers. We leverage the official training and validation datasets for training and evaluation. Before training, to test the performance of the MT system, we use a partially separated 3K instance of the training set as a test set. Performance of each NMT model in our experiments is measured by the BLEU score papineni2002bleu, which is the common metric in NMT field, and for the precise evaluation, we adopted Jieba 202020https://github.com/fxsjy/jieba and MeCab 212121https://github.com/taku910/mecab as tokenizers of Chinese and Japanese output sequence.

# of Sents # of min toks # of max toks Avg toks per S # of min chars # of max chars Avg chars per S
Korean-English Parallel Corpus Train KO 1,399,116 1 78 12.97 4 359 55.07
EN 1,399,116 2 180 22.57 10 999 135.24
Valid KO 200,302 2 32 17.46 9 220 75.36
EN 200,302 3 110 30.73 14 706 187.89
Test KO 3,000 7 25 9.02 25 113 35.71
EN 3,000 5 12 10.46 31 101 66.50
Korean-English Parallel Corpus (Social Science) Train KO 474,967 2 138 12.57 11 629 52.94
EN 474,967 2 271 20.71 9 1,617 128.78
Valid KO 59,746 6 144 12.58 21 636 52.96
EN 59,746 6 280 20.73 29 1,550 128.88
Test KO 3,000 6 45 12.60 22 236 53.00
EN 3,000 7 69 20.86 35 451 129.33
Korean-English Parallel Corpus (Technology) Train KO 697,665 4 37 12.25 21 180 51.97
EN 697,665 1 52 19.23 9 311 115.88
Valid KO 87,583 4 35 12.25 23 155 51.95
EN 87,583 5 48 19.24 37 310 115.86
Test KO 3,000 6 34 12.23 26 161 51.98
EN 3,000 6 42 19.17 45 294 115.75
Korean-English Domain-Specialized Parallel Corpus Train KO 1,197,000 3 35 15.30 11 304 66.18
EN 1,197,000 1 145 27.53 1 1,001 167.23
Valid KO 150,000 5 32 15.30 14 192 66.21
EN 150,000 4 97 27.55 26 691 167.34
Test KO 3,000 7 30 15.50 27 147 67.79
EN 3,000 7 69 28.56 46 433 175.57
Korean-Japanese Parallel Corpus Train KO 1,197,000 3 35 15.65 11 216 67.56
JA 1,197,000 NA NA NA 9 250 61.63
Valid KO 150,000 4 31 15.70 18 243 67.87
EN 150,000 NA NA NA 13 241 61.69
Test KO 3,000 4 30 14.15 20 168 61.99
JA 3,000 NA NA NA 16 186 58.87
Korean-Chinese Parallel Corpus (Social Science) Train KO 1,037,000 3 78 15.95 12 359 69.03
ZH 1,037,000 NA NA NA 5 259 46.73
Valid KO 130,000 4 52 15.66 12 283 68.04
ZH 130,000 NA NA NA 7 200 46.29
Test KO 3,000 6 30 14.35 25 151 62.12
ZH 3,000 NA NA NA 11 117 37.77
Korean-Chinese Parallel Corpus (Technology) Train KO 1,037,000 2 35 15.82 10 236 69.01
ZH 1,037,000 NA NA NA 7 296 48.22
Valid KO 130,000 3 31 15.93 17 213 69.71
ZH 130,000 NA NA NA 9 199 49.07
Test KO 3,000 4 30 15.07 22 163 65.75
ZH 3,000 NA NA NA 14 181 45.91
Table 5: Summary of overall AIHUB datasets. For statistics on tokens, we denote NA because there are no spaces in Japanese and Chinese.

4.2 Models Detail

For verifying the quality of dataset provided by AI Hub, we constructed transformer based NMT model vaswani2017attention that trained with each dataset. Transformer is an auto-regressive model structure which comprises encoder-decoder architecture, and is widely utilized in many NLP research fields, including NMT, in achieving SOTA performance. Corresponding model refrains recurrence and constructs its encoder-decoder model architecture by mainly applying attention structure. This enables considerable reduction of required training time by allowing significantly more parallelization in training process. Attention based model structure of transformer also can relieve long term dependency problem of RNN and LSTM hochreiter1997long. Output results of attention structure can be described as equation(1).

(1)

In equation (1), , and refers to trainable parameters. Attention structure takes three input; query, key and value, which is denoted as , and , respectively. Through this structure, transformer can obtain the relational information between input sentence and generating sentence. In such cases, the embedding obtained from input sentence is fed to the attention structure as and , and the embedding from the generating sentence is regarded as . Attention structure is also leveraged to obtain the bidirectional contextual information of input sentence and generating sentence, through self-attention mechanism which takes identical embedding value as , , and simultaneously.

We construct transformer NMT model trained with each AI Hub dataset. We regard the performance of NMT model as the quality of parallel corpus, by controlling all the training conditions of our experiments to be identical, except the training dataset. Training objective of transformer based NMT model that trained with parallel corpus can be described as equation (2).

(2)

Overall process is similar to the training of sequence to sequence sutskever2014sequence based MT model. In (2), and indicate source and target sentence in , respectively. Target sentence comprises total tokens, which are denoted as , and through this training process, corresponding model is trained to generate auto-regressively.

In our training process, we used adam optimizer with noam decay, and all the batch size is set to be 4096. The transformer NMT model in our experiments consists of six encoder and decoder layers with six attention blocks and eight attention heads, which dimensionality and embedding size is 512.

For the pre-processing of our training data, we utilized sentencepiece kudo2018sentencepiece subword tokenization method, with 32,000 vocab size. We extracted 5,000 and 3,000 samples randomly from training data for the validation and test set, respectively. The performance evaluation of all the translation results are proceeded with BLEU score by leveraging multi-bleu.perl script222222https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl given by Moses.

4.3 Main Results

Corpus Language BLEU
Korean-English Parallel corpus KO-EN 28.36
EN-KO 13.53
Korean-English Parallel corpus (Social Science) KO-EN 45.64
EN-KO 17.71
Korean-English Parallel corpus (Technology) KO-EN 63.88
EN-KO 39.17
Korean-English Domain-specialized Parallel corpus KO-EN 51.88
EN-KO 21.99
Korean-Japanese Parallel corpus KO-JA 68.88
JA-KO 49.05
Korean-Chinese Parallel corpus (Social Science) KO-ZH 48.74
ZH-KO 25.16
Korean-Chinese Parallel corpus (Technology) KO-ZH 46.70
ZH-KO 25.75
Table 6: Experimental results for seven datasets and three language pairs published by AI Hub.

Performance analysis

The baseline results of the seven AI Hub parallel corpora are listed in Table 6. The experiment showed a BLEU score of 28.36 for the Korean-English NMT model trained using Korean-English parallel corpora. In the case of the NMT model trained using the Korean-English parallel corpus (technology), the performance showed a BLEU score of over 50, which shows that the words and expressions in a specific domain appear quite repeatedly. For the NMT models based on other fields, the Korean-English parallel corpus (social science) and the Korean-English domain-specialized parallel corpus also demonstrated high performance of 45.64 and 51.88, respectively, in similar contexts.

Considering the significant performance gap between domain and general corpora, we can point out probable limitation of corpus construction. Although the performance of all domain corpora is overwhelmingly higher than general corpora, this result does not guarantee that the NMT model is well operating because we randomly extract test set within the training set. Corpora built on the basis of a particular domain typically have significant overlap parts with other sentences within such corpora, but there still exist many different expressions and words present in the field. This can cause difficulties in translating other various expressions. Therefore, our experimental results show that corpus generators should include much more diverse expressions especially in specific domains given that a well-constructed corpora makes a model smarter.

The Korean to Japanese NMT model based on the Korean-Japanese parallel corpus scored 68.88. The BLEU scores of the NMT models trained using the Korean-Chinese parallel corpus (Social Science) and the Korean-Chinese parallel corpus (Technology) are 48.74 and 46.70, respectively.

Language direction analysis

As a result of conducting both Korean to English translation and the opposite based on four Korean-English parallel corpora, the gap of the experimental results for translating Korean to English compared to those of the opposite case differ significantly, with scores from 14.83 to 29.89.

This can be interpreted in terms of data construction. Using parallel corpora built by translating sentences from one language to another, translation results can be awkward when training the model in the opposite direction. Thus, a reasonable construction process for training direction-robust NMT models involves building a parallel corpus by constructing about half as the source language and the other half as the target language and translating each. In other words, given the significant differences in performance when changing the direction of translation, it is highly likely that the translation was carried out using only a monolingual corpus, which consists of a source language without considering the opposite direction. Similarly, in the case of Korean-Japanese and Korean-Chinese models, the performance in the opposite direction was significantly reduced. These aspects should be considered when building parallel datasets in the future.

In this paper, such a problem is defined as “data imbalance” cai2015challenges; park2021study, and the problem must be solved when constructing data in the future . As for high-quality data, it is important that various elements are ultimately built in a balanced manner, and we conducted further analysis in this respect.

Correlation Analysis between LIWC and BLEU score

We analyzed a correlation between LIWC features and BLEU score for observing the connection between them. We employed BLEU score derived from the Korean-English corpus results in Table 6 and used only LIWC features of the Korean-English case to do this. As shown in Figure 8, we calculated a Pearson correlation benesty2009pearson joining all the features coming from LIWC, BLEU score (KO-EN), and BLUE (EN-KO).

Figure 8: Results of the correlation between LIWC features and BLEU score (KO-EN). The blue-colored indicates a positive correlation while red-colored indicates a negative correlation.
Figure 9: Negative correlation () results of the important factors between BLEU score (KO-EN) and LIWC features. The empty cells (white) indicate cut-offed due to the positive value. Note that this result is statistically significant as .

We can infer following result with Table 8. First, the overall tendency of correlation within LIWC features is mostly not different from analysis in Section 3. For example, Analytic and sentimental levels show a negative correlation. This is a unified result since Analytic indicates whether emotions are excluded and tone of the text is logical. In other words, the results give validity to the LIWC analysis.

Secondly, we show that correlations between LIWC features and BLEU score is highly negative. It can be said that training data is good when it has balance in tone, length, gender and so on. However, there are many things to improve in AI Hub such as word count, punctuation usage, sentimental analysis. It is because there are numerous negative effects in terms of data imbalance. This suggests in which direction we should build data and informs us that the performances can be improved through data cleaning such as PCF koehn2020findings; park-etal-2021-bts.

Additionally, we distilled the features by the case of statistically significant negative correlation in Figure 9. The most striking result to emerge from the BLEU score between En-Ko and Ko-En is that English-Korean translation affects the negative effect on BLEU score by those features, and many features need to filter than Korean-English translation. This result suggests further research on which factors are considered to remove during the data filtering process. Also, our findings are supported by BLEU score about showing the lower score in EN-KO than KO-EN in terms of data imbalance as shown in Table 6.

Finally, this paper figured out the association between LIWC and BLEU score in terms of data filtering. It may assist to make guidelines for building datasets later.

5 Discussion and Positive Impact of this study

This paper conducted in-depth analyses on various parallel corpora published by AIHUB. Structural components that directly determine the quality of each corpus were closely investigated through the LIWC, and the actual usability of each corpus was quantitatively evaluated through the NMT model trained by the corresponding corpus. Through these, we have posed a positive impact on the machine translation research fields and figured out the desirable direction of data construction. Specifically, main contributions of our paper can be described as follows:

First, to the best of our knowledge, for the first time, we performed quantitative and qualitative in-depth analyses on AIHUB data. We adopted LIWC as an investigation tool for parallel corpus, and derived various meaningful information (e.g. quality of parallel corpus) by newly interpreting each component obtained from LIWC. As LIWC was generally used in psychological research, various aspects of corpus analysis were possible, such as morphological analysis, Syntactic analysis, and so on. It can be confirmed that the results were suitable for the features of each corpus in most cases. For example, in Section 3.2, informal language markers such as swear word and filler rather used although they rarely used in other corpus. It is because the corpus includes dialogues and spoken words. Additionally, the result of Word Count and Commas in Section 3.3 showed that Domain-Specialized Parallel Corpus tends to explain terminology in long sentences with commas. In Section 3.4, the Emotional tone of the technology corpus was relatively low in order to concisely explicate the terminology, not a description of emotions. Since there are many texts with the economy as a topic in Section 3.5, money of personal interest topic has the highest rate than other corpora. Furthermore, word per sentence in Section 3.6 and 3.7 was the longest due to the characteristics of Chinese, which rarely uses spaces. Away from model-centric machine translation studies, this paper encourages data-centric research differently. This can have a positive impact on the NMT research field by presenting a new perspective.

Second, We pointed out the problems of the data construction process by revealing that there was a significant discrepancy in performance between English-Korean and Korean-English NMT model trained by the identical parallel corpus. It can be inferred that it is caused by the improper construction strategy. When constructing a parallel corpus that comprises certain two languages (i.e. Korean and English), it is desirable to construct a balanced corpus by translating a half of the translation into the first language based on the second language and the remaining half of the translation into the second language based on the first language. Through the empirical analysis, we point out that this aspect may underestimated.

For the last, we revealed that several important factors that determine the quality of corpus. In Section 3.5, we can infer that domain uniformity is neglected because it contains medical text in social science corpus. The gender bias, which had a major influence on the quality of corpus, was also overlooked in several corpora, especially in Section 3.3 and 3.5, there was a double gender bias. Additionally, we proposed that subject omission and cross-reference resolution problems should be further considered for ensuring the high quality data.

Eventually, this paper clearly analyzed the strengths and weaknesses of the existing AIHUB data and provided insight into the future direction of data construction.

In general, in the case of data filtering, mathematical and modeling approaches are taken koehn2020findings; zhang2020parallel. Those approaches also can be reflected in our future direction of data construction. There are LIWC analysis studies on the text pope2016analysis; fast2016empath. Previous research could be one of our options to enhance our approach. A topic-related approach is useful to filter unrelated topics by LIWC analysis fast2016empath.

There are still limitations in the language pairs provided by LIWC. LIWC supports diverse languages include Arabic, Chinese, Dutch, English, French, German, Italian, Portuguese, Russian, Serbian, Spanish, and Turkish. They are being used in psychological or linguistic research in various countries garzon2020validacion; paixao2020fake. However, LIWC is not supported in low-resource languages such as Korean or Japanese since its dictionary has not been created. In the case of Korean, K-LIWCLee2004korean was once available and there are some studies using it Kim2016k. However, it could not analyze Korean corpora since K-LIWC is currently unavailable.

6 Conclusions

In this work, we proceeded with a quality evaluation of all the Korean-related parallel corpus, released by AI Hub. For the model-centric performance validation, we constructed a transformer based NMT model trained with each parallel corpus. Through quantitative and qualitative analysis of these NMT models, we point out some probable limitations on constructing corpora. First, for learning NMT model well in specific field, the domain corpora should contain various words and expressions in consideration of the excessive performance difference between domain and general corpora. Second, given the significant performance gap in terms of language direction, half of the parallel data to be built must be configured in the source language and the other half in the target language and then translated respectively.

Away from the model-centric analysis, we encouraged data-centric research through LIWC analysis. We figured out the association between LIWC and model performance in terms of data filtering. Through this analysis, we suggested the direction of further work to improve model performance. The national level re-examination of the various standards and building processes should be made for the encouragement of AI data construction researches. In the future, we plan to investigate efficient beam search strategies and new decoding methods by utilizing these AI Hub data. Also, to more accurately measure the model performance, we plan to build an official Korean-English test set.

References