Detecting Cross-Lingual Plagiarism Using Simulated Word Embeddings

12/29/2017
by   Victor Thompson, et al.
0

Cross-lingual plagiarism (CLP) occurs when texts written in one language are translated into a different language and used without acknowledgement. One of the most common methods used for detecting CLP requires online machine translators (such as Google or Microsoft translate) which are not always available, and given that plagiarism detection typically involves large document comparison, the amount of translations would overwhelm an online machine translator, especially when detecting plagiarism over the web. In addition, when translated texts are replaced with their synonyms, using online machine translators to detect CLP would result in poor performance. This paper addresses the problem of cross-lingual plagiarism detection (CLPD) by proposing a model that uses simulated word embeddings to reproduce the predictions of an online machine translator (Google translate) when detecting CLP without relying on online translators. The simulated embeddings comprise of translated words in different languages mapped in a common space, and replicated to increase the prediction probability of retrieving the translations of a word (and their synonyms) from the model. Unlike most existing models, the proposed model does not require parallel corpora, and accommodates multiple languages (multi-lingual). We demonstrated the effectiveness of the proposed model in detecting CLP in standard datasets that contain CLP cases, and evaluated its performance against a state-of-the-art baseline that relies on online machine translator (T+MA model). Evaluation results revealed that the proposed model is not only effective in detecting CLP, it outperformed the baseline. The results indicate that CLP could be detected with state-of-the-art performances by leveraging the prediction accuracy of an internet translator with word embeddings without relying on internet translators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/04/2022

Cross-lingual Word Embeddings in Hyperbolic Space

Cross-lingual word embeddings can be applied to several natural language...
research
12/29/2017

Methods for Detecting Paraphrase Plagiarism

Paraphrase plagiarism is one of the difficult challenges facing plagiari...
research
02/21/2020

Refinement of Unsupervised Cross-Lingual Word Embeddings

Cross-lingual word embeddings aim to bridge the gap between high-resourc...
research
03/30/2018

Robust Cross-lingual Hypernymy Detection using Dependency Context

Cross-lingual Hypernymy Detection involves determining if a word in one ...
research
10/31/2018

Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective

Count-based word alignment methods, such as the IBM models or fast-align...
research
03/25/2022

Probabilistic Embeddings with Laplacian Graph Priors

We introduce probabilistic embeddings using Laplacian priors (PELP). The...
research
10/08/2017

Clickbait detection using word embeddings

Clickbait is a pejorative term describing web content that is aimed at g...

Please sign up or login with your details

Forgot password? Click here to reset