Language Modeling by Clustering with Word Embeddings for Text Readability Assessment

09/05/2017
by   Miriam Cha, et al.
0

We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences. We argue that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression. Also, by representing features in terms of histograms, our approach can naturally address documents of varying lengths. An empirical evaluation using the Common Core Standards corpus reveals that the features formed on our clustering-based language model significantly improve the previously known results for the same corpus in readability prediction. We also evaluate the task of sentence matching based on semantic relatedness using the Wiki-SimpleWiki corpus and find that our features lead to superior matching performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2020

Exploring the Value of Personalized Word Embeddings

In this paper, we introduce personalized word embeddings, and examine th...
research
04/18/2016

Clustering Comparable Corpora of Russian and Ukrainian Academic Texts: Word Embeddings and Semantic Fingerprints

We present our experience in applying distributional semantics (neural w...
research
07/03/2019

Clustering of Medical Free-Text Records Based on Word Embeddings

Is it true that patients with similar conditions get similar diagnoses? ...
research
10/06/2020

Automatic Metaphor Interpretation Using Word Embeddings

We suggest a model for metaphor interpretation using word embeddings tra...
research
05/21/2017

Learning Semantic Relatedness From Human Feedback Using Metric Learning

Assessing the degree of semantic relatedness between words is an importa...
research
12/28/2017

Corpus specificity in LSA and Word2vec: the role of out-of-domain documents

Latent Semantic Analysis (LSA) and Word2vec are some of the most widely ...
research
01/31/2021

Short Text Clustering with Transformers

Recent techniques for the task of short text clustering often rely on wo...

Please sign up or login with your details

Forgot password? Click here to reset