Empirical Analysis of Korean Public AI Hub Parallel Corpora and in-depth Analysis using LIWC

10/28/2021
by   Chanjun Park, et al.
0

Machine translation (MT) system aims to translate source language into target language. Recent studies on MT systems mainly focus on neural machine translation (NMT). One factor that significantly affects the performance of NMT is the availability of high-quality parallel corpora. However, high-quality parallel corpora concerning Korean are relatively scarce compared to those associated with other high-resource languages, such as German or Italian. To address this problem, AI Hub recently released seven types of parallel corpora for Korean. In this study, we conduct an in-depth verification of the quality of corresponding parallel corpora through Linguistic Inquiry and Word Count (LIWC) and several relevant experiments. LIWC is a word-counting software program that can analyze corpora in multiple ways and extract linguistic features as a dictionary base. To the best of our knowledge, this study is the first to use LIWC to analyze parallel corpora in the field of NMT. Our findings suggest the direction of further research toward obtaining the improved quality parallel corpora through our correlation analysis in LIWC and NMT performance.

READ FULL TEXT
research
04/17/2018

Investigating Backtranslation in Neural Machine Translation

A prerequisite for training corpus-based machine translation (MT) system...
research
04/15/2021

Simultaneous Multi-Pivot Neural Machine Translation

Parallel corpora are indispensable for training neural machine translati...
research
04/05/2018

Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts

Although there are increasing and significant ties between China and Por...
research
10/30/2021

How should human translation coexist with NMT? Efficient tool for building high quality parallel corpus

This paper proposes a tool for efficiently constructing high-quality par...
research
05/21/2020

MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

Multi-word expressions (MWEs) are a hot topic in research in natural lan...
research
07/28/2021

Investigating Text Simplification Evaluation

Modern text simplification (TS) heavily relies on the availability of go...
research
03/10/2021

Majority Voting with Bidirectional Pre-translation For Bitext Retrieval

Obtaining high-quality parallel corpora is of paramount importance for t...

Please sign up or login with your details

Forgot password? Click here to reset