Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment

06/12/2021
by   Dilan Sachintha, et al.
0

Multilingual sentence representations pose a great advantage for low-resource languages that do not have enough data to build monolingual models on their own. These multilingual sentence representations have been separately exploited by few research for document and sentence alignment. However, most of the low-resource languages are under-represented in these pre-trained models. Thus, in the context of low-resource languages, these models have to be fine-tuned for the task at hand, using additional data sources. This paper presents a weighting mechanism that makes use of available small-scale parallel corpora to improve the performance of multilingual sentence representations on document and sentence alignment. Experiments are conducted with respect to two low-resource languages, Sinhala and Tamil. Results on a newly created dataset of Sinhala-English, Tamil-English, and Sinhala-Tamil show that this new weighting mechanism significantly improves both document and sentence alignment. This dataset, as well as the source-code, is publicly released.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2022

Very Low Resource Sentence Alignment: Luhya and Swahili

Language-agnostic sentence embeddings generated by pre-trained models su...
research
10/12/2019

Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations

We investigate whether off-the-shelf deep bidirectional sentence represe...
research
08/21/2021

Metric Learning in Multilingual Sentence Similarity Measurement for Document Alignment

Document alignment techniques based on multilingual sentence representat...
research
06/16/2020

How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

Sentence encoders map sentences to real valued vectors for use in downst...
research
12/13/2016

Vicinity-Driven Paragraph and Sentence Alignment for Comparable Corpora

Parallel corpora have driven great progress in the field of Text Simplif...
research
09/19/2018

Monolingual sentence matching for text simplification

This work improves monolingual sentence alignment for text simplificatio...
research
12/08/2021

ADBCMM : Acronym Disambiguation by Building Counterfactuals and Multilingual Mixing

Scientific documents often contain a large number of acronyms. Disambigu...

Please sign up or login with your details

Forgot password? Click here to reset