The Development of a Labelled te reo Māori-English Bilingual Database for Language Technology

08/21/2022
by   Jesin James, et al.
3

Te reo Māori (referred to as Māori), New Zealand's indigenous language, is under-resourced in language technology. Māori speakers are bilingual, where Māori is code-switched with English. Unfortunately, there are minimal resources available for Māori language technology, language detection and code-switch detection between Māori-English pair. Both English and Māori use Roman-derived orthography making rule-based systems for detecting language and code-switching restrictive. Most Māori language detection is done manually by language experts. This research builds a Māori-English bilingual database of 66,016,807 words with word-level language annotation. The New Zealand Parliament Hansard debates reports were used to build the database. The language labels are assigned using language-specific rules and expert manual annotations. Words with the same spelling, but different meanings, exist for Māori and English. These words could not be categorised as Māori or English based on word-level language rules. Hence, manual annotations were necessary. An analysis reporting the various aspects of the database such as metadata, year-wise analysis, frequently occurring words, sentence length and N-grams is also reported. The database developed here is a valuable tool for future language and speech technology development for Aotearoa New Zealand. The methodology followed to label the database can also be followed by other low-resourced language pairs.

READ FULL TEXT

page 11

page 14

page 15

page 19

research
10/28/2022

UzbekStemmer: Development of a Rule-Based Stemming Algorithm for Uzbek Language

In this paper we present a rule-based stemming algorithm for the Uzbek l...
research
09/24/2018

Hindi-English Code-Switching Speech Corpus

Code-switching refers to the usage of two languages within a sentence or...
research
10/28/2022

Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language

Lemmatization is one of the core concepts in natural language processing...
research
10/26/2022

Pronunciation Generation for Foreign Language Words in Intra-Sentential Code-Switching Speech Recognition

Code-Switching refers to the phenomenon of switching languages within a ...
research
03/19/2018

English-Catalan Neural Machine Translation in the Biomedical Domain through the cascade approach

This paper describes the methodology followed to build a neural machine ...
research
03/01/2000

A database and lexicon of scripts for ThoughtTreasure

Since scripts were proposed in the 1970's as an inferencing mechanism fo...
research
12/02/2014

Exemplar Dynamics and Sound Merger in Language

We develop a model of phonological contrast in natural language. Specifi...

Please sign up or login with your details

Forgot password? Click here to reset