Topical Change Detection in Documents via Embeddings of Long Sequences

12/07/2020
by   Dennis Aumiller, et al.
0

In a longer document, the topic often slightly shifts from one passage to the next, where topic boundaries are usually indicated by semantically coherent segments. Discovering this latent structure in a document improves the readability and is essential for passage retrieval and summarization tasks. We formulate the task of text segmentation as an independent supervised prediction task, making it suitable to train on Transformer-based language models. By fine-tuning on paragraphs of similar sections, we are able to show that learned features encode topic information, which can be used to find the section boundaries and divide the text into coherent segments. Unlike previous approaches, which mostly operate on sentence-level, we consistently use a broader context of an entire paragraph and assume topical independence of preceeding and succeeding text. We lastly introduce a novel large-scale dataset constructed from online Terms-of-Service documents, on which we compare against various traditional and deep learning baselines, showing significantly better performance of Transformer-based methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2021

Transformer over Pre-trained Transformer for Neural Text Segmentation with Enhanced Topic Coherence

This paper proposes a transformer over transformer framework, called Tra...
research
09/29/2015

Automatically Segmenting Oral History Transcripts

Dividing oral histories into topically coherent segments can make them m...
research
06/14/2021

Automatic Document Sketching: Generating Drafts from Analogous Texts

The advent of large pre-trained language models has made it possible to ...
research
07/14/2019

Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation

This paper describes the Microsoft Translator submissions to the WMT19 n...
research
11/30/2021

Bilingual Topic Models for Comparable Corpora

Probabilistic topic models like Latent Dirichlet Allocation (LDA) have b...
research
04/14/2019

Text segmentation on multilabel documents: A distant-supervised approach

Segmenting text into semantically coherent segments is an important task...
research
02/13/2019

SECTOR: A Neural Model for Coherent Topic Segmentation and Classification

When searching for information, a human reader first glances over a docu...

Please sign up or login with your details

Forgot password? Click here to reset