Text Segmentation based on Semantic Word Embeddings

03/18/2015
by   Alexander A. Alemi, et al.
0

We explore the use of semantic word embeddings in text segmentation algorithms, including the C99 segmentation algorithm and new algorithms inspired by the distributed word vector representation. By developing a general framework for discussing a class of segmentation objectives, we study the effectiveness of greedy versus exact optimization approaches and suggest a new iterative refinement technique for improving the performance of greedy strategies. We compare our results to known benchmarks, using known metrics. We demonstrate state-of-the-art performance for an untrained method with our Content Vector Segmentation (CVS) on the Choi test set. Finally, we apply the segmentation procedure to an in-the-wild dataset consisting of text extracted from scholarly articles in the arXiv.org database.

READ FULL TEXT
research
08/05/2020

An exploration of the encoding of grammatical gender in word embeddings

The vector representation of words, known as word embeddings, has opened...
research
05/17/2017

Utility of general and specific word embeddings for classifying translational stages of research

Conventional text classification models make a bag-of-words assumption r...
research
04/06/2023

Affect as a proxy for literary mood

We propose to use affect as a proxy for mood in literary texts. In this ...
research
11/21/2021

More Romanian word embeddings from the RETEROM project

Automatically learned vector representations of words, also known as "wo...
research
08/09/2017

Identifying Reference Spans: Topic Modeling and Word Embeddings help IR

The CL-SciSumm 2016 shared task introduced an interesting problem: given...
research
09/21/2021

InvBERT: Text Reconstruction from Contextualized Embeddings used for Derived Text Formats of Literary Works

Digital Humanities and Computational Literary Studies apply text mining ...
research
04/28/2017

Neural Word Segmentation with Rich Pretraining

Neural word segmentation research has benefited from large-scale raw tex...

Please sign up or login with your details

Forgot password? Click here to reset