Detecting Cross-Language Plagiarism using Open Knowledge Graphs

11/18/2021
by   Johannes Stegmüller, et al.
0

Identifying cross-language plagiarism is challenging, especially for distant language pairs and sense-for-sense translations. We introduce the new multilingual retrieval model Cross-Language Ontology-Based Similarity Analysis (CL-OSA) for this task. CL-OSA represents documents as entity vectors obtained from the open knowledge graph Wikidata. Opposed to other methods, CL-OSA does not require computationally expensive machine translation, nor pre-training using comparable or parallel corpora. It reliably disambiguates homonyms and scales to allow its application to Web-scale document collections. We show that CL-OSA outperforms state-of-the-art methods for retrieving candidate documents from five large, topically diverse test corpora that include distant language pairs like Japanese-English. For identifying cross-language plagiarism at the character level, CL-OSA primarily improves the detection of sense-for-sense translations. For these challenging cases, CL-OSA's performance in terms of the well-established PlagDet score exceeds that of the best competitor by more than factor two. The code and data of our study are openly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2019

A Massive Collection of Cross-Lingual Web-Document Pairs

Cross-lingual document alignment aims to identify pairs of documents in ...
research
08/27/2021

From Pivots to Graphs: Augmented CycleDensity as a Generalization to One Time InverseConsultation

This paper describes an approach used to generate new translations using...
research
06/11/2021

Semi-Supervised and Unsupervised Sense Annotation via Translations

Acquisition of multilingual training data continues to be a challenge in...
research
09/21/2021

Multilingual Document-Level Translation Enables Zero-Shot Transfer From Sentences to Documents

Document-level neural machine translation (DocNMT) delivers coherent tra...
research
05/24/2017

Deep Investigation of Cross-Language Plagiarism Detection Methods

This paper is a deep investigation of cross-language plagiarism detectio...
research
12/21/2022

Predicting the Score of Atomic Candidate OWL Class Axioms

Candidate axiom scoring is the task of assessing the acceptability of a ...
research
12/20/2022

Perplexed by Quality: A Perplexity-based Method for Adult and Harmful Content Detection in Multilingual Heterogeneous Web Data

As demand for large corpora increases with the size of current state-of-...

Please sign up or login with your details

Forgot password? Click here to reset