Sinhala-English Parallel Word Dictionary Dataset

08/04/2023
by   Kasun Wickramasinghe, et al.
0

Parallel datasets are vital for performing and evaluating any kind of multilingual task. However, in the cases where one of the considered language pairs is a low-resource language, the existing top-down parallel data such as corpora are lacking in both tally and quality due to the dearth of human annotation. Therefore, for low-resource languages, it is more feasible to move in the bottom-up direction where finer granular pairs such as dictionary datasets are developed first. They may then be used for mid-level tasks such as supervised multilingual word embedding alignment. These in turn can later guide higher-level tasks in the order of aligning sentence or paragraph text corpora used for Machine Translation (MT). Even though more approachable than generating and aligning a massive corpus for a low-resource language, for the same reason of apathy from larger research entities, even these finer granular data sets are lacking for some low-resource languages. We have observed that there is no free and open dictionary data set for the low-resource language, Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which help in multilingual Natural Language Processing (NLP) tasks related to English and Sinhala languages. In this paper, we explain the dataset creation pipeline as well as the experimental results of the tests we have carried out to verify the quality of the data sets. The data sets and the related scripts are available at https://github.com/kasunw22/sinhala-para-dict.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/16/2020

Improving Multilingual Neural Machine Translation For Low-Resource Languages: French-, English- Vietnamese

Prior works have demonstrated that a low-resource language pair can bene...
research
12/21/2020

Subword Sampling for Low Resource Word Alignment

Annotation projection is an important area in NLP that can greatly contr...
research
04/24/2020

Practical Comparable Data Collection for Low-Resource Languages via Images

We propose a method of curating high-quality comparable training data fo...
research
12/08/2021

ADBCMM : Acronym Disambiguation by Building Counterfactuals and Multilingual Mixing

Scientific documents often contain a large number of acronyms. Disambigu...
research
10/12/2022

SilverAlign: MT-Based Silver Data Algorithm For Evaluating Word Alignment

Word alignments are essential for a variety of NLP tasks. Therefore, cho...
research
08/11/2020

A parallel evaluation data set of software documentation with document structure annotation

This paper accompanies the software documentation data set for machine t...
research
12/31/2020

Open Korean Corpora: A Practical Report

Korean is often referred to as a low-resource language in the research c...

Please sign up or login with your details

Forgot password? Click here to reset