Self-tuning hyper-parameters for unsupervised cross-lingual tokenization

03/04/2023
by   Anton Kolonin, et al.
0

We explore the possibility of meta-learning for the language-independent unsupervised tokenization problem for English, Russian, and Chinese. We implement the meta-learning approach for automatic determination of hyper-parameters of the unsupervised tokenization model proposed in earlier works, relying on various human-independent fitness functions such as normalised anti-entropy, compression factor and cross-split F 1 score, as well as additive and multiplicative composite combinations of the three metrics, testing them against the conventional F1 tokenization score. We find a fairly good correlation between the latter and the additive combination of the former three metrics for English and Russian. In case of Chinese, we find a significant correlation between the F 1 score and the compression factor. Our results suggest the possibility of robust unsupervised tokenization of low-resource and dead languages and allow us to think about human languages in terms of the evolution of efficient symbolic communication codes with different structural optimisation schemes that have evolved in different human cultures.

READ FULL TEXT
research
06/04/2023

Evolution of Efficient Symbolic Communication Codes

The paper explores how the human natural language structure can be seen ...
research
04/10/2021

Meta-learning for fast cross-lingual adaptation in dependency parsing

Meta-learning, or learning to learn, is a technique that can help to ove...
research
05/18/2022

Persian Natural Language Inference: A Meta-learning approach

Incorporating information from other languages can improve the results o...
research
04/17/2022

Ìtàkúròso: Exploiting Cross-Lingual Transferability for Natural Language Generation of Dialogues in Low-Resource, African Languages

We investigate the possibility of cross-lingual transfer from a state-of...
research
01/27/2021

Multilingual and cross-lingual document classification: A meta-learning approach

The great majority of languages in the world are considered under-resour...

Please sign up or login with your details

Forgot password? Click here to reset