SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

02/13/2019
by   Sebastian Arnold, et al.
0

When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation.

READ FULL TEXT
research
12/16/2016

Automatic Labelling of Topics with Neural Embeddings

Topics generated by topic models are typically represented as list of te...
research
10/28/2022

Toward Unifying Text Segmentation and Long Document Summarization

Text segmentation is important for signaling a document's structure. Wit...
research
09/29/2020

Neural Topic Modeling with Cycle-Consistent Adversarial Training

Advances on deep generative models have attracted significant research i...
research
10/12/2022

Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics

In this paper, we consider the task of retrieving documents with predefi...
research
12/21/2014

Extraction of Salient Sentences from Labelled Documents

We present a hierarchical convolutional document model with an architect...
research
12/07/2020

Topical Change Detection in Documents via Embeddings of Long Sequences

In a longer document, the topic often slightly shifts from one passage t...
research
12/28/2017

Topic Compositional Neural Language Model

We propose a Topic Compositional Neural Language Model (TCNLM), a novel ...

Please sign up or login with your details

Forgot password? Click here to reset