An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

10/06/2020
by   Kyubyong Park, et al.
0

Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even though Byte Pair Encoding (BPE) has been considered the de facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, "What is the best tokenization strategy for Korean NLP tasks?" Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/02/2021

How Suitable Are Subword Segmentation Strategies for Translating Non-Concatenative Morphology?

Data-driven subword segmentation has become the default strategy for ope...
research
04/17/2023

Improving Autoregressive NLP Tasks via Modular Linearized Attention

Various natural language processing (NLP) tasks necessitate models that ...
research
08/29/2018

Attention-based Neural Text Segmentation

Text segmentation plays an important role in various Natural Language Pr...
research
10/13/2021

Systematic Inequalities in Language Technology Performance across the World's Languages

Natural language processing (NLP) systems have become a central technolo...
research
02/28/2015

The NLP Engine: A Universal Turing Machine for NLP

It is commonly accepted that machine translation is a more complex task ...
research
03/29/2022

Visualizing the Relationship Between Encoded Linguistic Information and Task Performance

Probing is popular to analyze whether linguistic information can be capt...

Please sign up or login with your details

Forgot password? Click here to reset