Text segmentation on multilabel documents: A distant-supervised approach

04/14/2019
by   Saurav Manchanda, et al.
0

Segmenting text into semantically coherent segments is an important task with applications in information retrieval and text summarization. Developing accurate topical segmentation requires the availability of training data with ground truth information at the segment level. However, generating such labeled datasets, especially for applications in which the meaning of the labels is user-defined, is expensive and time-consuming. In this paper, we develop an approach that instead of using segment-level ground truth information, it instead uses the set of labels that are associated with a document and are easier to obtain as the training data essentially corresponds to a multilabel dataset. Our method, which can be thought of as an instance of distant supervision, improves upon the previous approaches by exploiting the fact that consecutive sentences in a document tend to talk about the same topic, and hence, probably belong to the same class. Experiments on the text segmentation task on a variety of datasets show that the segmentation produced by our method beats the competing approaches on four out of five datasets and performs at par on the fifth dataset. On the multilabel text classification task, our method performs at par with the competing approaches, while requiring significantly less time to estimate than the competing approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/26/2019

CAWA: An Attention-Network for Credit Attribution

Credit attribution is the task of associating individual parts in a docu...
research
01/20/2023

Document Summarization with Text Segmentation

In this paper, we exploit the innate document segment structure for impr...
research
09/28/2022

Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task

Text segmentation aims to divide text into contiguous, semantically cohe...
research
05/09/2021

DocSCAN: Unsupervised Text Classification via Learning from Neighbors

We introduce DocSCAN, a completely unsupervised text classification appr...
research
12/07/2020

Topical Change Detection in Documents via Embeddings of Long Sequences

In a longer document, the topic often slightly shifts from one passage t...
research
09/29/2015

Automatically Segmenting Oral History Transcripts

Dividing oral histories into topically coherent segments can make them m...
research
09/21/2017

Inducing Distant Supervision in Suggestion Mining through Part-of-Speech Embeddings

Mining suggestion expressing sentences from a given text is a less inves...

Please sign up or login with your details

Forgot password? Click here to reset