PESTS: Persian_English Cross Lingual Corpus for Semantic Textual Similarity

05/13/2023
by   Mohammad Abdous, et al.
0

One of the components of natural language processing that has received a lot of investigation recently is semantic textual similarity. In computational linguistics and natural language processing, assessing the semantic similarity of words, phrases, paragraphs, and texts is crucial. Calculating the degree of semantic resemblance between two textual pieces, paragraphs, or phrases provided in both monolingual and cross-lingual versions is known as semantic similarity. Cross lingual semantic similarity requires corpora in which there are sentence pairs in both the source and target languages with a degree of semantic similarity between them. Many existing cross lingual semantic similarity models use a machine translation due to the unavailability of cross lingual semantic similarity dataset, which the propagation of the machine translation error reduces the accuracy of the model. On the other hand, when we want to use semantic similarity features for machine translation the same machine translations should not be used for semantic similarity. For Persian, which is one of the low resource languages, no effort has been made in this regard and the need for a model that can understand the context of two languages is felt more than ever. In this article, the corpus of semantic textual similarity between sentences in Persian and English languages has been produced for the first time by using linguistic experts. We named this dataset PESTS (Persian English Semantic Textual Similarity). This corpus contains 5375 sentence pairs. Also, different models based on transformers have been fine-tuned using this dataset. The results show that using the PESTS dataset, the Pearson correlation of the XLM ROBERTa model increases from 85.87 95.62

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/19/2018

A Resource-Light Method for Cross-Lingual Semantic Textual Similarity

Recognizing semantically similar sentences or paragraphs across language...
research
07/31/2017

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Semantic Textual Similarity (STS) measures the meaning similarity of sen...
research
07/16/2019

Language comparison via network topology

Modeling relations between languages can offer understanding of language...
research
03/16/2017

Neobility at SemEval-2017 Task 1: An Attention-based Sentence Similarity Model

This paper describes a neural-network model which performed competitivel...
research
12/16/2022

Detecting and Mitigating Hallucinations in Machine Translation: Model Internal Workings Alone Do Well, Sentence Similarity Even Better

While the problem of hallucinations in neural machine translation has lo...
research
08/24/2023

Text Similarity from Image Contents using Statistical and Semantic Analysis Techniques

Plagiarism detection is one of the most researched areas among the Natur...
research
07/11/2018

Linear Transformations for Cross-lingual Semantic Textual Similarity

Cross-lingual semantic textual similarity systems estimate the degree of...

Please sign up or login with your details

Forgot password? Click here to reset