We present a clustering-based language model using word embeddings for text readability prediction. Presumably, an Euclidean semantic space hypothesis holds true for word embeddings whose training is done by observing word co-occurrences. We argue that clustering with word embeddings in the metric space should yield feature representations in a higher semantic space appropriate for text regression. Also, by representing features in terms of histograms, our approach can naturally address documents of varying lengths. An empirical evaluation using the Common Core Standards corpus reveals that the features formed on our clustering-based language model significantly improve the previously known results for the same corpus in readability prediction. We also evaluate the task of sentence matching based on semantic relatedness using the Wiki-SimpleWiki corpus and find that our features lead to superior matching performance.READ FULL TEXT VIEW PDF
In this paper, we introduce personalized word embeddings, and examine th...
We present our experience in applying distributional semantics (neural w...
Is it true that patients with similar conditions get similar diagnoses? ...
Assessing the degree of semantic relatedness between words is an importa...
We suggest a model for metaphor interpretation using word embeddings tra...
Recent techniques for the task of short text clustering often rely on wo...
Latent Semantic Analysis (LSA) and Word2vec are some of the most widely ...
Predicting reading difficulty of a document is an enduring problem in natural language processing (NLP). Approaches based on shallow-length features of text date back to 1940s(Flesch, 1948)
. Remarkably, they are still being used and extended with more sophisticated techniques today. In this paper, we use word embeddings to compose semantic features that are presumably beneficial for assessing text readability. Encouraged by the recent literature in applying language models for better prediction, we aim to build a clustering-based language model using word vectors learned from corpora. The resulting model is expected to reveal semantics at a higher level than word embeddings and provide discriminative features for text regression.
As pioneering work in text difficulty prediction, Flesch (Flesch, 1948) explored on shallow-length features computed by averaging the number of words per sentence and the number of syllables per word. The intent was to capture sentence complexity with the number of words, and word complexity with the number of syllables. Chall (Chall, 1958) claimed the reading difficulty as a linear function of shallow-length features. Kincaid (Kincaid, 1975) introduced a linear weighting scheme that became the most common measure of reading difficulty based on shallow-length features. More sophisticated algorithms that measure semantics by word frequency counts and syntax from sentence length (Stenner, 1996) and language modeling (Collins-Thompson and Callan, 2005) have shown significant performance gains over classical methods.
Modern approaches treat text difficulty prediction as a discriminative task. Schwarm et al. (Schwarm and Ostendorf, 2005)
presented text regression based on support vector machine (SVM). Petersonet al. (Petersen and Ostendorf, 2009) used both SVM classification and regression for improvement. NLP researchers went beyond the shallow features and looked into learning complex lexical and grammatical features. Flor et al. (Flor et al., 2013) proposed an algorithm that measures lexical complexity from word usage. Vajjala et al. (Vajjala and Meurers, 2014)
formulated semantic and syntactic complexity features from language modeling, which resulted some improvement. Class-based language models, trained on the conditional probability of a word given the classes of its previous words, were commonly used in the literature(Turian et al., 2010; Botha and Blunsom, 2014). Brown clustering (Brown et al., 1992)
, a popular class-based language model, can learn hierarchical clusters of words by maximizing the mutual information of word bigrams.
Our text learning is founded on word embeddings. Bengio et al. (Bengio et al., 2003) proposed an early neural embedding framework. Mikolov et al. (Mikolov et al., 2013) introduced the Skip-gram model for efficient training with large unstructured text, and Paragraph Vector (Lau and Baldwin, 2016) and character -grams (Bojanowski et al., 2016), all of which we use for our implementation in this paper, followed on. Most word embedding algorithms build on the distributional hypothesis (Harris, 1954) that word co-occurrences imply similar meaning and context. Word embeddings span a high-dimensional semantic space where the Euclidean distance between word vectors measures their semantic dissimilarity (Hashimoto et al., 2016).
Under the Euclidean semantic space hypothesis, we argue that clustering of word vectors should unveil a clustering-based language model. In particular, we propose two clustering methods to construct language models using Brown clustering and -means with word vectors. Our methods are language-independent and data-driven, and we have empirically validated their superior performance in text readability assessment. Specifically, our experiment on the Common Core Standards corpus reveal that the language model learned by -means significantly improves readability prediction against contemporary approaches using the lexical and syntactic features. In another experiment with the Wiki-SimpleWiki corpus, we show that our features can correctly identify sentence pairs of the similar meaning but written in different vocabulary and grammatical structure.
For text with easy readability, difference in reading difficulty is resulted from different document length, sentence structure, and word usage. For documents at higher reading levels, however, features with richer linguistic context about domain, grammar, and style are known to be more relevant. For example, based on shallow features, “To be or not to be, that is the question” would likely be considered easier than “I went to the store and bought bread, eggs, and bacon, brought them home, and made a sandwich.” Therefore, we need to capture all semantic, lexical, and grammatical features for distinguishing documents at all levels.
We organize the rest of this paper as follows. In Section 2, we describe our approach centered around neural word embedding and probabilistic language modeling. We will explain each component of our approach in detail. Section 3 presents our experimental methodology for evaluation. We will also discuss the empirical results. Section 4 concludes the paper.
We review embedding schemes, clustering algorithms, and regression method used in the paper, and describe our overall pipeline.
Skip-gram. Mikolov et al. (Mikolov et al., 2013)
proposed the Skip-gram method based on a neural network that maximizes
where the training word sequence has a length . With as the center word, is the training context window. The conditional probability can be computed with the softmax function
with the scoring function . The embedding is a vector representation of the word .
Bag of character -grams. Bojanowski et al. (Bojanowski et al., 2016) proposed an embedding method by representing each word as the sum of the vector representations of its character -grams. To capture the internal structure of words, a different scoring function is introduced
Here, is the set of -grams in . A vector representation is associated to each -gram . This approach has an advantage in representing unseen or rare words in corpus. If the training corpus is small, character -grams can outperform the Skip-gram (of words) approach.
Distributed bag-of-words. While Skip-gram and character -grams can embed a word into a high-dimensional vector space, we eventually need to compute a feature vector for the whole document. Le et al. (Le and Mikolov, 2014) introduced Paragraph Vector that learns a fixed-length vector representation for variable-length text such as sentences and paragraphs. The distributed bag-of-words version of Paragraph Vector has the same architecture as the Skip-gram model except that the input word vector is replaced by a paragraph token.
Brown clustering. Brown et al. (Brown et al., 1992) introduced a hierarchical clustering algorithm that maximizes the mutual information of word bigrams. The probability for a set of words can be written as
where is a function that maps a word to its class, and is a special start state. Brown clustering hierarchically merges clusters to maximize the quality of . The quality is maximized when mutual information between all bigram classes are maximized. Although Brown clustering is commonly used, a major drawback is its limitation to learn only bigram statistics.
K-means. Because word embeddings span a semantic space, clusters of word embeddings should give a higher semantic space. We perform -means on word embeddings. The resulting clusters are word classes grouped in semantic similarity under the Euclidean metric constraint. Given word embeddings learned from a corpus, we find the cluster membership for a word as
where is the th cluster centroid.
We consider linear support vector machine (SVM) regression
where the regressed estimatefor th input is optimized to be bound within an error margin from the ground-truth label . SVM trains a bias term to better compensate regression errors along the weight vector . We train SVM regression using feature vectors formed on word embedding and clustering to predict the readability score.
depicts our prediction pipeline using word clusters precomputed by K-means on word embeddings. When a document of an unknown readability level arrives, we preprocess tokenized text input and compute word vectors using trained word embeddings. We compute cluster membership on word vectors, followed by average pooling. For cluster membership, we perform the 1-of-hard assignment for each word in the document. Then we compute the histogram of cluster membership. By representing features in terms of histograms our approach can naturally address documents of varying lengths. After some post-processing (e.g., unit-normalization), we regress the readability level.
|Brown clustering||0.546 (0.430)||0.534 (0.443)|
|word2vec +-means||0.711 (0.670)||0.705 (0.664)|
|fastText +-means||0.825 (0.758)||0.822 (0.810)|
|Flor et al. (Flor et al., 2013)||-||-0.44|
|Reading Maturity 555http://readingmaturity.com||0.69||-|
|Vajjala et al. (Vajjala and Meurers, 2014)||0.69||0.61|
Following Vajjala et al. (Vajjala and Meurers, 2014), we evaluate readability level prediction with the Common Core Standards corpus (Council of Chief State School Officers, 2010) and sentence matching with the Wiki-SimpleWiki corpus (Zhu et al., 2010).
This corpus of 168 English excerpts are available as the Appendix B of the Common Core Standards reading initiative of the US education system. Each text excerpt is labeled with a level in five grade bands (2-3, 4-5, 6-8, 9-10, 11+) as established by educational experts. Grade levels 2.5, 4.5, 7, 9.5, and 11.5 are used as ground-truth labels. We cut the corpus into train and test sets in an uniformly random 80-20 split, resulting 136 documents for training and 32 for test.
Evaluation metric. For fair comparison with other work, we adopt Spearman’s rank correlation and Pearson correlation computed between the ground-truth label and regressed value.
Preprocessing. We convert all characters to lowercase, strip punctuations, and remove extra whitespace, URLs, currency, numbers, and stopwords using the NLTK Stopwords Corpus (Loper and Bird, 2002).
Features. There are two levels of features. At the word-vector level, we perform weighted average pooling of word embeddings to compose per-document feature vector. We have tried tf-idf and uniform weighting schemes. Brown clustering of words yields the word-vector level features as well. On the contrary, -means clustering of word vectors yields higher-level features in terms of cluster structures. For Brown and -means, we replace each word in a document with its numeric cluster ID and compute the histogram of cluster membership as per-document feature vector. For histogram computing, we consider binary (on/off) and traditional bin counts.
Word and paragraph embeddings. We use word2vec for the Skip-gram word embeddings. We have first tried out the wiki and ap-news pretrained word2vec
models. Eventually, we use TensorFlow to trainword2vec
model from the Common Core Standards corpus. We have optimized the word-vector dimension hyperparameter between 32 and 300.
We use fastText for character -gram word embeddings. Similar to our word2vec experiment, we have tried the wiki and ap-news pretrained models for fastText
before training our own. While training, we use the negative sampling loss function with word-vector dimensions 32 to 300 and context window size of 5.
We use doc2vec that implements Paragraph Vector. We have not trained our own doc2vec model and opted for the wiki and ap-news pretrained doc2vec models.
Brown clustering. We use an open-source implementation by Liang et al. (Liang, P., 2012). We have fine-tuned the number of cluster hyperparameter by varying between 10 and 200.
K-means clustering. After embedding all words in each document, we run -means. We fine-tune within 10 to 200.
SVM regression. We use LIBLINEAR (Fan et al., 2008) for SVM regression and configure as the -regularized -loss linear solver with unit bias. The SVM complexity hyperparameter is optimized between
Results and discussion. Our baseline results with pretrained models are shown in Table 1. Bag-of-words performs poorly, and word2vec performs better than doc2vec. We suspect that the benefit of doc2vec is not realized on this corpus due to its limited length. We find fastText superior over word2vec and doc2vec. Pretrained wiki outperforms ap-news. We only report wiki results.
Table 2 presents results on clustering-based language models: Brown clustering on words and -means on trained word vectors using the corpus. Presented correlation values are for binary (inside parenthesis) and traditional bin counts. While binary counters could be robust against ambiguities resulting from repeated texts in a document, this advantage is not present in the corpus we use here. Brown clustering on words has similar performance to baseline embedding schemes. The comparable performances are expected, because both Brown clustering and the baseline embedding schemes are performed on the raw words. We can improve performance further with -means clustering on word vectors. Rather than training word vector models on wiki, training with the Common Core Standards corpus improves the correlation. fastText with -means works the best.
Table 3 presents a summary that compares performances of our approach and the previous work. Flor et al. (Flor et al., 2013) implemented prediction scheme based on lexical tightness and compared their method against baselines such as text length and Flesch-Kincaid (Kincaid, 1975) in Pearson correlation. Nelson et al. (Nelson et al., 2012) wrote a summary of commercial softwares’ performances in Spearman correlation. Most recently, Vajjala et al. (Vajjala and Meurers, 2014) implemented a scheme that uses lexical, syntactic, and psycholinguistic features. Our highest correlation for Spearman is 0.83, and 0.82 for Pearson, both of which are better than the best case reported by the previous work.
We demonstrate our features derived from clustering of word embeddings are effective in another application concerning sentence matching. The corpus for this application consists of 108,016 aligned sentence pairs of the same meaning drawn from (ordinary) Wikipedia and Simple Wikipedia.777http://simple.wikipedia.org Simple Wikipedia uses basic vocabulary and less complex grammar to make the content of Wikipedia accessible to audiences of all reading skills.
Task and metric. We evaluate whether or not the feature vector for an ordinary sentence formed by the proposed feature scheme can correctly predict its counterpart sentence. We sample 1,000 sentence pairs. Among all 1,000 pairs, we compute the probability that ordinary sentences and their simple counterparts are nearest neighbors in the semantic space. We vary to .
Features. We use our best feature scheme, word embedding by fastText and -means, found in Section 3.1. To compute sentence embedding, we average-pool all word embeddings in the sentence.
Results and discussion. As Table 4 shows, using only the nearest neighbor, we already achieve ; as grows, we can contain different sentences of the same meaning with probability approaching 1. This implies that despite differences in grammatical structure and word usage, when underlying semantics are shared between two sentences, they are mapped closely each other in the feature space.
Word vectors learned on neural embedding exhibit linguistic regularities and patterns explicitly. In this paper, we have introduced a regression framework on clustering-based language model using word embeddings for automatic text readability prediction. Our experiments with the Common Core Standards corpus demonstrate that features derived by clustering word embeddings are superior to classical shallow-length, bag-of-words, and other advanced features previously attempted on the corpus. We have further evaluated our approach on sentence matching using the Wiki-SimpleWiki corpus and showed that our method can effectively capture semantics even when sentences are written with different vocabulary and grammatical structures. For future work, we plan to continue our experiments with more diverse languages and larger datasets.
This work is supported by the MIT Lincoln Laboratory Lincoln Scholars Program and in part by gifts from the Intel Corporation and the Naval Supply Systems Command award under the Naval Postgraduate School Agreements No. N00244-15-0050 and No. N00244-16-1-0018.
A Machine Learning Approach to Reading Level Assessment.Computer Speech and Language 23, 1 (2009), 89–106.
Word representations: a simple and general method for semi-supervised learning. InProceedings of the 48th annual meeting of the association for computational linguistics. Association for Computational Linguistics, 384–394.