WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia

05/10/2023
by   Kenichiro Ando, et al.
0

Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, we conducted automated essay scoring experiments to evaluate the generalizability of the dataset. We show that the models trained on WikiSQE perform better than the vanilla model, indicating its potential usefulness in other domains. WikiSQE is expected to be a valuable resource for other tasks in NLP.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2021

Document-Level Text Simplification: Dataset, Criteria and Baseline

Text simplification is a valuable technique. However, current research i...
research
11/16/2021

WikiContradiction: Detecting Self-Contradiction Articles on Wikipedia

While Wikipedia has been utilized for fact-checking and claim verificati...
research
10/20/2020

AutoMeTS: The Autocomplete for Medical Text Simplification

The goal of text simplification (TS) is to transform difficult text into...
research
10/18/2018

Large-scale Hierarchical Alignment for Author Style Transfer

We propose a simple method for extracting pseudo-parallel monolingual se...
research
04/02/2022

Learning to Simplify with Data Hopelessly Out of Alignment

We consider whether it is possible to do text simplification without rel...
research
02/25/2022

Mining Naturally-occurring Corrections and Paraphrases from Wikipedia's Revision History

Naturally-occurring instances of linguistic phenomena are important both...
research
09/11/2019

ORES: Lowering Barriers with Participatory Machine Learning in Wikipedia

Algorithmic systems -- from rule-based bots to machine learning classifi...

Please sign up or login with your details

Forgot password? Click here to reset