Unsupervised Tokenization Learning

05/23/2022
by   Anton Kolonin, et al.
0

In the presented study, we discover that the so-called "transition freedom" metric appears superior for unsupervised tokenization purposes in comparison to statistical metrics such as mutual information and conditional probability, providing F-measure scores in range from 0.71 to 1.0 across explored multilingual corpora. We find that different languages require different offshoots of that metric (such as derivative, variance, and "peak values") for successful tokenization. Larger training corpora do not necessarily result in better tokenization quality, while compressing the models by eliminating statistically weak evidence tends to improve performance. The proposed unsupervised tokenization technique provides quality better than or comparable to lexicon-based ones, depending on the language.

READ FULL TEXT
research
05/09/2012

Multilingual Topic Models for Unaligned Text

We develop the multilingual topic model for unaligned text (MuTo), a pro...
research
05/01/2020

Multilingual Unsupervised Sentence Simplification

Progress in Sentence Simplification has been hindered by the lack of sup...
research
03/22/2016

Multi-domain machine translation enhancements by parallel data extraction from comparable corpora

Parallel texts are a relatively rare language resource, however, they co...
research
12/05/2015

PJAIT Systems for the IWSLT 2015 Evaluation Campaign Enhanced by Comparable Corpora

In this paper, we attempt to improve Statistical Machine Translation (SM...
research
12/05/2015

Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents

The multilingual nature of the world makes translation a crucial require...
research
06/11/2018

Learning Multilingual Topics from Incomparable Corpus

Multilingual topic models enable crosslingual tasks by extracting consis...
research
02/16/2017

Fast and unsupervised methods for multilingual cognate clustering

In this paper we explore the use of unsupervised methods for detecting c...

Please sign up or login with your details

Forgot password? Click here to reset