WikiCSSH: Extracting and Evaluating Computer Science Subject Headings from Wikipedia

09/10/2021
by   Kanyao Han, et al.
0

Hierarchical domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiCSSH; a large-scale, hierarchically organized vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of vocabulary size as well as the performance of lexicon-based key-phrase extraction from scholarly data. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2020

Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks

We present Wiki-CS, a novel dataset derived from Wikipedia for benchmark...
research
02/12/2021

Bootstrapping Large-Scale Fine-Grained Contextual Advertising Classifier from Wikipedia

Contextual advertising provides advertisers with the opportunity to targ...
research
03/28/2022

Computer Science Named Entity Recognition in the Open Research Knowledge Graph

Domain-specific named entity recognition (NER) on Computer Science (CS) ...
research
08/19/2019

"Computer Science for all": Concepts to engage teenagers and non-CS students in technology

Knowledge in Computer Science (CS) is essential, and companies have incr...
research
10/26/2021

AVocaDo: Strategy for Adapting Vocabulary to Downstream Domain

During the fine-tuning phase of transfer learning, the pretrained vocabu...
research
07/23/2017

Fine Grained Citation Span for References in Wikipedia

Verifiability is one of the core editing principles in Wikipedia, editor...
research
05/24/2022

Overview of STEM Science as Process, Method, Material, and Data Named Entities

We are faced with an unprecedented production in scholarly publications ...

Please sign up or login with your details

Forgot password? Click here to reset