Topic Segmentation of Research Article Collections

05/18/2022
by   Erion Çano, et al.
0

Collections of research article data harvested from the web have become common recently since they are important resources for experimenting on tasks such as named entity recognition, text summarization, or keyword generation. In fact, certain types of experiments require collections that are both large and topically structured, with records assigned to separate research disciplines. Unfortunately, the current collections of publicly available research articles are either small or heterogeneous and unstructured. In this work, we perform topic segmentation of a paper data collection that we crawled and produce a multitopic dataset of roughly seven million paper data records. We construct a taxonomy of topics extracted from the data records and then annotate each document with its corresponding topic from that taxonomy. As a result, it is possible to use this newly proposed dataset in two modalities: as a heterogeneous collection of documents from various disciplines or as a set of homogeneous collections, each from a single research topic.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2020

Two Huge Title and Keyword Generation Corpora of Research Articles

Recent developments in sequence-to-sequence learning with neural network...
research
05/06/2018

Dynamic and Static Topic Model for Analyzing Time-Series Document Collections

For extracting meaningful topics from texts, their structures should be ...
research
08/19/2015

Fast, Flexible Models for Discovering Topic Correlation across Weakly-Related Collections

Weak topic correlation across document collections with different number...
research
06/19/2020

Neural Topic Modeling with Continual Lifelong Learning

Lifelong learning has recently attracted attention in building machine l...
research
11/25/2019

My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Comparative text mining extends from genre analysis and political bias d...
research
06/18/2018

The Off-Topic Memento Toolkit

Web archive collections are created with a particular purpose in mind. A...
research
03/29/2018

Computer-Assisted Text Analysis for Social Science: Topic Models and Beyond

Topic models are a family of statistical-based algorithms to summarize, ...

Please sign up or login with your details

Forgot password? Click here to reset