A Study on the Appropriate size of the Mongolian general corpus

07/12/2023
by   Sunsoo Choi, et al.
0

This study aims to determine the appropriate size of the Mongolian general corpus. This study used the Heaps function and Type Token Ratio to determine the appropriate size of the Mongolian general corpus. The sample corpus of 906,064 tokens comprised texts from 10 domains of newspaper politics, economy, society, culture, sports, world articles and laws, middle and high school literature textbooks, interview articles, and podcast transcripts. First, we estimated the Heaps function with this sample corpus. Next, we observed changes in the number of types and TTR values while increasing the number of tokens by one million using the estimated Heaps function. As a result of observation, we found that the TTR value hardly changed when the number of tokens exceeded from 39 to 42 million. Thus, we conclude that an appropriate size for a Mongolian general corpus is from 39 to 42 million tokens.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2018

Types, Tokens, and Hapaxes: A New Heap's Law

Heap's Law states that in a large enough text corpus, the number of type...
research
11/12/2016

1.5 billion words Arabic Corpus

This study is an attempt to build a contemporary linguistic corpus for A...
research
08/25/2023

Assessing Keyness using Permutation Tests

We propose a resampling-based approach for assessing keyness in corpus l...
research
09/12/2023

Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code Switching Analysis

We present the first shared task for detecting and analyzing code-switch...
research
09/25/2019

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects wr...
research
10/22/2017

How big is big enough? Unsupervised word sense disambiguation using a very large corpus

In this paper, the problem of disambiguating a target word for Polish is...
research
05/21/2023

Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus

We present a large-scale in-the-wild Japanese laughter corpus and a laug...

Please sign up or login with your details

Forgot password? Click here to reset