Supervised and unsupervised neural approaches to text readability

We present a set of novel neural supervised and unsupervised approaches for determining readability of documents. In the unsupervised setting, we leverage neural language models, while in the supervised setting three different neural architectures are tested in the classification setting. We show that the proposed neural unsupervised approach on average produces better results than traditional readability formulas and is transferable across languages. Employing neural classifiers, we outperform current state-of-the-art classification approaches to readability which rely on standard machine learning classifiers and extensive feature engineering. We tested several properties of the proposed approaches and showed their strengths and possibilities for improvements.


Nonparametric Unsupervised Classification

Unsupervised classification methods learn a discriminative classifier fr...

Deep-QPP: A Pairwise Interaction-based Deep Learning Model for Supervised Query Performance Prediction

Motivated by the recent success of end-to-end deep neural models for ran...

Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages

Unsupervised translation has reached impressive performance on resource-...

Legal Area Classification: A Comparative Study of Text Classifiers on Singapore Supreme Court Judgments

This paper conducts a comparative study on the performance of various ma...

Unsupervised and Distributional Detection of Machine-Generated Text

The power of natural language generation models has provoked a flurry of...

CNN features are also great at unsupervised classification

This paper aims at providing insight on the transferability of deep CNN ...

Cross Device Matching for Online Advertising with Neural Feature Ensembles : First Place Solution at CIKM Cup 2016

We describe the 1st place winning approach for the CIKM Cup 2016 Challen...

1 Introduction

Readability is concerned with the relation between a given text and the cognitive load of a reader to comprehend it. This complex relation is influenced by many factors, such as a degree of lexical and syntactic sophistication, discourse cohesion, and background knowledge (Crossley et al., 2017). In order to simplify the problem of measuring readability, traditional readability formulas focused only on the lexical and syntactic features by taking into an account various statistical factors, such as word length, sentence length, and word difficulty (Davison and Kantor, 1982). These approaches have been criticized because of their reductionism and weak statistical bases (Crossley et al., 2017). Another problem is their objectivity and cultural transferability, since children from different environments master different concepts at different ages. For example, a word television is quite long and contains many syllables but is well-known to most young children who live in families with a television.

Newer approaches to measuring readability consider it as a classification task and build prediction models that predict human assigned readability scores based on a number of attributes (Schwarm and Ostendorf, 2005; Vajjala and Meurers, 2012; Petersen and Ostendorf, 2009). These more sophisticated and adaptable approaches generally yield better results and are less exposed to critique but require additional external resources, such as labeled readability data sets, which are scarce. Another problem is transferability of these approaches between different corpora and languages, since little work has been done on multilingual, multi-genre, or even multi-corpora supervised approaches to readability prediction.

Recently, deep neural networks

(Goodfellow, Bengio, and Courville, 2016) have shown impressive performance on many language related tasks. In fact, they have achieved the state-of-the-art performance in all semantic tasks where sufficient amounts of data were available (Collobert et al., 2011; Zhang, Zhao, and LeCun, 2015). Surprisingly, we are not aware of any work that would employ deep neural models for the task of determining readability. Even the most recent studies (Vajjala and Lucic, 2018)

still rely on hand-crafted features and standard classifiers, such as Support Vector Machines (SVM), when trying to determine text readability. Furthermore, language model features, which can be found in many of these classification approaches

(Schwarm and Ostendorf, 2005; Petersen and Ostendorf, 2009; Vajjala and Meurers, 2012; Xia, Kochmar, and Briscoe, 2016)

, are generated with traditional n-gram language models, even though language modeling, which can be formally defined as predicting a probability distribution of words from the fixed size vocabulary

, for word , given the historical sequence , has been drastically improved with the introduction of neural language models (Mikolov et al., 2011).

The aim of the present study is two-fold. First, we propose a novel approach to readability measurement based on deep neural network based language models that takes into account background knowledge and discourse cohesion, two readability indicators missing from the traditional readability formulas. This approach is unsupervised and requires no labeled training set but only a collection of texts from the given domain. We demonstrate that the proposed approach is capable of contextualizing the readability because of the trainable nature of neural networks, and that it is transferable across different languages. In this scope, we propose a new measure of readability, RSRS (ranked sentence readability score), with good correlation with true readability scores.

Second, we experiment how different neural architectures with automatized feature generation can be used for readability classification and compare their performance to standard classification approaches, which rely on hand crafted features. Three distinct branches of neural architectures – recurrent neural networks (RNN), hierarchical attention networks (HAN), and transfer learning techniques – are tested on four gold standard readability corpora with excellent results.

The paper is structured as follows. Section 2 addresses the related work on readability prediction and also covers more general topics related to our research, such as language modelling and neural text classification. Section 3 describes the datasets used in our experiments, while in Section 4 we present the methodology and results for the proposed unsupervised approach to readability prediction. The methodology and experimental results for the supervised approach are presented in Section 5. The conclusions and directions for further work are addressed in Section 6.

2 Background and related work

Approaches to automated measuring of readability try to find and assess factors that correlate well with human perception of readability. They can be divided into two groups. Traditional readability formulas try to construct a simple human comprehensible formula with a good correlation to what humans perceive as the degree of readability. They take into account various statistical factors, such as word length, sentence length, and word difficulty. We describe the most popular constructs in Section 2.1. Newer approaches train machine learning models on texts with human-annotated readability levels so that they can predict readability levels on new unlabeled texts. These approaches usually rely on extensive feature engineering and construct many features, both human comprehensible and incomprehensible. We describe these approaches in Section 2.2. Many of these features are generated using language models. Since language models form the core of our approach, we shortly describe them and the features they can produce in Section 2.3.

The main novelty of the proposed approach is the use of neural language models and neural classifiers for determining readability, therefore we dedicate Section 2.4 to related work on neural language models and Section 2.5 to neural approaches to text classification.

2.1 Readability formulas

Traditionally, readability in texts was measured by statistical readability formulas. Most of these formulas were originally developed for English language but are also applicable to other languages with some modifications (Škvorc et al., 2018).

The Gunning fog index (Gunning, 1952)

(GFI) estimates the years of formal education a person needs to understand the text on the first reading. It is calculated with the following expression:

where longWords are words longer than 7 characters. Higher values of the index indicate lower readability.

Flesch reading ease (Kincaid et al., 1975) (FRE) assigns higher values to more readable texts. It is calculated in the following way:

The values returned by the Flesch-Kincaid grade level (Kincaid et al., 1975) (FKGL) readability formula correspond to the number of years of education generally required to understand the text for which the formula was calculated. The formula is defined as follows:

Another readability formula that returns values corresponding to the years of education required to understand the text is Automated readability index (Smith and Senter, 1967) (ARI):

Dale-Chall readability formula (Dale and Chall, 1948) (DCRF) requires a list of 3000 words that fourth-grade American students could reliably understand. Words that do not appear in this list are considered difficult. If the list of words is not available, it is possible to use the GFI approach and consider all the words longer than 7 characters as difficult. The following expression is used in calculation:

The SMOG grade (Simple Measure of Gobbledygook) Mc Laughlin (1969) is a readability formula mostly used for checking health messages. Similar as FKGL and ARI, it roughly corresponds to the years of education needed to understand the text. It is calculated with the following expression:

where the numberOfPolysyllables is the number of words with three or more syllables.

All of the above mentioned readability measures were designed for the specific use on English texts. There are some rare attempts to adapt these formulas to other languages (Kandel and Moles, 1958) or to create new formulas that could be used on languages other than English (Anderson, 1981).

To show a cross-lingual potential of our approach, we address two languages in this study, English and Slovenian, a Slavic language with rich morphology and orders of magnitude less resources compared to English. For Slovenian, readability studies are scarce. Škvorc et al. (2018) researched how well the above readability formulas work on Slovenian text by trying to categorize text from three distinct sources: children’s magazines, newspapers and magazines for adults, and transcriptions of sessions of the National Assembly of Slovenia. Results of this study indicate that formulas which consider the length of words and/or sentences work better than formulas which rely on word lists. They also noticed that simple indicators of readability, such as percentage of adjectives and average sentence length, also work quite well for Slovenian. To our knowledge, the only other study that employed readability formulas on Slovenian texts was done by Zwitter Vitez (2014). Here the readability formulas were used as features in the author recognition task.

2.2 Classification approach to readability

The alternative to measuring readability with statistical formulas is to consider it a prediction task and predict the level of readability. These approaches usually require extensive feature engineering and thereby address some deficiencies of statistical formulas, such as their reductionism and dismissal of contextual and semantic information.

One of the first classification approaches to readability was proposed by Schwarm and Ostendorf (2005). It relies on a Support Vector Machine (SVM) classifier trained on a WeeklyReader corpus111, containing articles grouped into four classes according to the age of the target audience. Statistical language models, statistical readability formulas, and parse trees are used as features in the model. This approach was extended and improved upon in Petersen and Ostendorf (2009).

A successful classification approach to readability was proposed by Vajjala and Meurers (2012)

. Their multi-layer perceptron classifier is trained on the WeeBit corpus

(Vajjala and Meurers, 2012), which contains articles from WeeklyReader and BBC-Bitesize222 (see Section 3 for more information on the WeeBit corpus). The texts were classified into five classes according to the age group they are targeting. For classification, the authors use 46 manually crafted features roughly grouped into three categories: lexical (e.g., n-grams), syntactic (e.g., parse tree depth), and traditional features (e.g., average sentence length). For the evaluation, they trained the classifier on a train set consisting of 500 documents from each class and tested it on a balanced test set of 625 documents (containing 125 documents per each class). They report 93.3% accuracy on the test set333A later research by Xia, Kochmar, and Briscoe (2016) called the validity of the published experimental results into question, therefore the reported 93.3% accuracy might not be the objective state-of-the-art result for readability classification..

Another set of experiments on the WeeBit corpus was conducted by Xia, Kochmar, and Briscoe (2016) who conducted additional cleaning of the corpus since it contained some texts with broken sentences and additional meta information about the source of the text, such as copyright declaration and links, strongly correlated with the target labels. They use similar lexical, syntactic, and traditional features as Vajjala and Meurers (2012) but add language modeling and discourse based features. Their SVM classifier achieves 80.3% accuracy using the 5-fold cross-validation. This is one of a few studies where the transferability of the classification models is tested. Authors used an additional CEFR (Common European Framework of Reference for Languages) corpus. This small data set of CEFR-graded texts is tailored for learners of English (Council of Europe, 2001) and also contains 5 readability classes. The SVM classifier trained on the WeeBit corpus and tested on the CEFR corpus achieved the classification accuracy of 23,3%, hardly beating the majority classifier baseline. This low result was attributed to the differences in readability classes in both corpora, since WeeBit classes are targeting children of different age groups, and CEFR corpus classes are targeting mostly adult foreigners with different levels of English comprehension. However, this result is a strong indication that transferability of readability classification models across different types of texts is questionable.

The very recent classification approaches to readability still employ standard machine learning classifiers and rely on an extensive feature engineering. An approach proposed by Vajjala and Lucic (2018), tested on a recently published OneStopEnglish corpus, relies on 155 hand-crafted features grouped into six categories: n-grams, part-of-speech (POS) tags, psycholinguistic (based on psycholinguistic databases), syntactic, discourse (e.g., coreference chains), and traditional features. Sequential Minimal Optimization (SMO) classifier with linear kernel achieved the classification accuracy of 78.13% for three readability classes (elementary, intermediate, and advanced reading level). An even more recent approach to readability classification conducted on Taiwanese textbooks was proposed by Tseng et al. (2019). The main novelty of the research was the introduction of a latent-semantic-analysis (LSA)-constructed hierarchical conceptual space that can be used as a feature for training an SVM classifier for domain-specific readability classification. They report significant improvements compared to previous state-of-the-art results when the new feature is combined with other more general linguistic features.

2.3 Statistical language models

The standard task of language modeling can be formally defined as predicting a probability distribution of words from the fixed size vocabulary , for word , given the historical sequence . From a statistical point of view, taking an entire historical sequence of words into consideration is problematic due to data sparsity, since the majority of possible word sequences will not be observed in the training sample. In order to handle sequences that were not seen during training, the standard solution (called the n-gram language model) limits the historical sequence to previous words, counts the observed n-grams, and employs any of a number of different smoothing techniques (Chen and Goodman, 1999). A special version of the n-gram model is a unigram model (), where the probability of each word depends only on that word’s probability in the document. A recent solution to data sparsity is the introduction of neural language models (Mikolov et al., 2011), which will be explained in Section 2.4.

To measure the performance of language models, traditionally a metric called perplexity is used. A language model is evaluated according to how well it predicts a separate test sequence of words . For this case, the perplexity (PPL) of the language model is defined as:


where is the probability assigned to word by the language model , and is the length of the sequence. The lower the perplexity score, the better the language model predicts the words in a document, i.e. the more predictable and aligned with the training set the text is.

Of special interest to our method are features of language models, used by many classification approaches (see Section 2.2 above). Schwarm and Ostendorf (2005) train one n-gram language model for each readability class c in the training data set. For each text document d, they calculate the likelihood ratio according to the following formula:

where denotes the probability returned by the language model trained on texts labeled with class c, and denotes probability of returned by the language model trained on the class

. Uniform prior probabilities of classes are assumed. The likelihood ratios are used as features in the classification model along with perplexities achieved by all the models.

In Petersen and Ostendorf (2009), three statistical language models (unigram, bigram and trigram) are trained on four external data resources: Britannica (adult), Britannica Elementary, CNN (adult) and CNN abridged. The resulting twelve n-gram language models are used to calculate perplexities of each target document. It is assumed that low perplexity scores calculated by the language models trained on the adult level texts and high perplexity scores calculated by the language models trained on the elementary/abridged levels would indicate a high reading level, and high perplexity scores calculated by the language models trained on the adult level texts and low perplexity scores calculated by the language models trained on the elementary/abridged levels would indicate a low reading level.

Xia, Kochmar, and Briscoe (2016) train 1- to 5-gram word-based language models on the British National Corpus, and 25 POS-based 1- to 5-gram models on the five classes of the WeeBit corpus. Language models’ log-likelihood and perplexity scores are used as features for the classifier.

Some approaches try to determine readability using only statistical scores derived from language models. Si and Callan (2001) tried to classify scientific web pages using only unigram language models. Further improving this approach, Collins-Thompson and Callan (2005) developed a smoothed unigram language model classifier in order to predict readability grade levels in a manually collected corpus of web pages. The classifier outperformed several other measures of semantic difficulty, such as the fraction of unknown words in the text and the FKGL on the corpus of web pages, although traditional measures performed better on some commercial corpora.

2.4 Neural language models

Mikolov et al. (2011) have shown that neural language models outperform n-gram language models by a high margin on large and also relatively small (less than 1 million tokens) data sets. The achieved differences in perplexity (see Eq. (1)) are attributed to a richer historical contextual information available to neural networks, which are not limited to a small contextual window (usually of up to five previous words) as is the case of n-gram language models. In Section 2.3, we mentioned some approaches that use n-gram language models for readability prediction. However, we are unaware of any approach that would employ deep neural network language models for determining readability of a text.

The most popular choice of neural architectures for language modelling are recurrent neural networks (RNN) due to their suitability for modelling sequential data. At each time step , an input vector and hidden state vector are feed into the network, producing the next hidden vector state with the following recursive equation:


is a non-linear activation function,

and are matrices representing weights of the input layer and hidden layer, and

a bias vector. Learning long-range dependencies with plain RNNs is problematic due to vanishing gradients

(Bengio, Simard, and Frasconi, 1994)

, therefore, in practice, modified recurrent networks, such as Long short-term memory networks (LSTM) are used. At each time step

, an LSTM network takes as input , hidden state , and a state of a memory cell to calculate and according to the following set of equations:

where , and are reffered to as input, forget and output gates, respectively. and are element-wise sigmoid and hyperbolic tangent functions and represents a dot product operation.

In our experiments, we use the LSTM-based language model proposed by Kim et al. (2016)

. This system is adapted to language modelling of morphologically rich languages, such as Slovenian, by employing an additional character level convolutional neural network (CNN). The convolutional level learns a character structure of words and is connected to the LSTM-based language model, which produces predictions at the word level.

Recently, Bai, Kolter, and Koltun (2018) introduced a new sequence modelling architecture based on convolution, called temporal convolutional network (TCN). TCN uses casual convolution operations, which make sure that there is no information leakage from future time steps to the past. This and the fact that TCN takes a sequence as an input and maps it into an output sequence of the same size, makes this architecture appropriate for language modelling. TCNs are capable of leveraging long contexts for their prediction by using a very deep network architecture and a hierarchy of dilated convolutions. A single dilated convolution operation on element of the 1-dimensional sequence can be defined with the following equation:

where is a filter of size , a dilation factor and accounts for the direction of the past. In this way, the context taken into account during the prediction can be increased by using larger filter sizes and by increasing the dilation factor. The most common practice is to increase the dilation factor exponentially with the depth of the network.

Another recent approach to language modelling was proposed by Devlin et al. (2018). The BERT (Bidirectional Encoder Representations from Transformers) uses both left and right context, which means that a word in a sequence is not determined just from its left sequence but also from its right word sequence . This approach introduces a new learning objective, a masked language model, where a predefined percentage of randomly chosen words from the input word sequence are masked, and the objective is to predict these masked words from the unmasked context. This approach uses a transformer architecture, which relies on a self-attention mechanism proposed by Vaswani et al. (2017). The distinguishing feature of this approach is the employment of several parallel attention layers, the so-called attention heads, which reduce the computational cost and allow the system to attend to several dependencies at once.

All types of neural network language models, TCN, LSTM, and BERT, output softmax probability distribution calculated over the entire vocabulary, and present the probabilities for each word given its historical (and in case of BERT also future) sequence. Training of these networks usually minimizes the negative log-likelihood (NLL) of the training corpus word sequence

by backpropagation through time:


In case of BERT, the formula for minimizing NLL uses also the right-hand word sequence:

where are the masked words.

The following equation, which is used for measuring the perplexity of neural language models, defines the relationship between perplexity (PPL, see Eq. (1)) and NLL (Eq. (2)):

2.5 Neural text classification

The trend in natural language-related learning is to use deep learning approaches which have demonstrated state-of-the-art performance on a variety of different classification tasks, e.g., sentiment analysis

(Tang, Qin, and Liu, 2015; Yang et al., 2016; Conneau et al., 2016), and topic categorization (Kusner et al., 2015; Yang et al., 2016; Conneau et al., 2016). We can divide the most popular neural network approaches to text classification into three groups, according to the architecture and learning technique used:

  • Recurrent neural networks (RNN). Since text is naturally represented as a sequence of characters, tokens, or words, the most frequent neural approach is to process it sequentially from left to right with RNN, which is capable of memorizing the already seen part of a sequence. Learning long sequences with the plain RNN is difficult due to vanishing gradients (Bengio, Simard, and Frasconi, 1994). Therefore, the most popular RNN variant is an LSTM network described in Section 2.4

    , which employs the forget gate mechanism to solve the vanishing gradient problem. Plain LSTMs are successful at capturing long contextual information but unfortunately, they also capture a lot of noise, often present in unstructured data such as text. Many improvements have been proposed, one of the most successful is to employ a max pooling operation on the LSTM produced word representation, in order to minimize noise and filter out words with low predictive power

    (Conneau et al., 2017).

  • Hierarchical attention network (HAN) (Yang et al., 2016) takes hierarchical structure of text into an account through the attention mechanism (Bahdanau, Cho, and Bengio, 2014; Xu et al., 2015) applied to word and sentence representations encoded by bidirectional RNNs. The main difference between the attention based approach and the filtering approach proposed by Conneau et al. (2017), is the acknowledgment, that the informativeness of words and sentences is context-dependent, therefore the same words and sentences in different documents might have a completely different predictive power.

    Given a sentence with the word representation , the attention mechanism on the word level can be described with the following set of equations:

    The word representation is first fed to a dense layer with the

    activation function to get a hidden representation

    . The importance of the hidden representation is calculated by measuring the similarity between the and randomly initialized context vector . The softmax function is applied to derive a normalized similarity weight , which is used for calculation of the final sequence vector as a weighted sum of the . The final sequence vector , calculated on the word level, is used as an input to the same attention mechanism on the sentence level, which produces a document representation as an output. This output is used as a feature matrix for the final document classification.

  • Transfer learning is the latest state-of-the-art approach to text classification (Howard and Ruder, 2018; Devlin et al., 2018)

    . In this approach, we first pretrain a neural language model on a large general corpora and then fine-tune this model for a specific classification task by adding the final classification layer. The network with an additional layer is trained for a few additional epochs on new data. The syntactic and semantic knowledge of the pretrained language model is transferred and leveraged for the new classification task. An example of this approach is the BERT language model

    (Devlin et al., 2018) pretrained on the concatenation of BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words), to which an additional linear classification head is added. This model achieved state-of-the-art results on many text classification tasks, such as the question answering task on the SQuAD dataset (Rajpurkar et al., 2016), and several language inference tasks.

3 Datasets

All experiments are conducted on four corpora labelled with readability scores:

  • The WeeBit corpus: The articles from WeeklyReader444 and BBC-Bitesize555 are classified into five classes according to the age group they are targeting. The classes correspond to age groups between 7-8, 8-9, 9-10, 10-14 and 14-16. In the original corpus of Vajjala and Meurers (2012), the classes are balanced and the corpus contains altogether 3125 documents, 625 per class. In our experiments, we followed recommendations of Xia, Kochmar, and Briscoe (2016) in order to fix broken sentences and remove additional meta information, such as copyright declaration and links, strongly correlated with the target labels. We reextracted the corpus from the HTML files according to the procedure described in Xia, Kochmar, and Briscoe (2016) and discarded some documents due to the lack of content after the extraction and cleaning process. The final corpus used in our experiments contains altogether 3000 documents, 600 per class.

  • The OneStopEnglish corpus (Vajjala and Lucic, 2018) contains aligned texts of three distinct reading levels (beginner, intermediate, and advanced) that were written specifically for English as Second Language (ESL) learners. The corpus consists of 189 texts, each written in three versions (567 in total). The corpus is freely available666

  • The Newsela corpus (Xu, Callison-Burch, and Napoles, 2015). We use the version of the corpus from 29 January 2016 consisting of altogether 10,786 documents, out of which we only used 9,565 English documents. The corpus contains 1,911 original English articles and up to five simplified versions for every original article. The original and simplified versions correspond to altogether eleven different grade levels (from 2nd to 12th grade). Grade levels are imbalanced; the exact numbers of articles per grade are presented in Table 1.

  • Corpus of Slovenian school books (Slovenian SB): In order to test the transferability of the proposed approaches to other languages, a corpus of Slovenian school books was compiled. The corpus contains 3,639,665 words in 125 school books for nine grades of primary schools and four grades of secondary school. For supervised classification experiments, we split the school books into chunks twenty sentences long, in order to build a train and test set with sufficient number of documents. The exact number of school books and chunks per grade are presented in Table 2.

Language models are trained on large corpora of texts. We used the following corpora.

  • Corpus of English Wikipedia and Corpus of Simple Wikipedia articles. We created three corpora for the use in our unsupervised English experiments777English Wikipedia and Simple Wikipedia dumps from 26th of January 2018 were used for the corpus construction:

    • Wiki-normal contains 130,000 randomly selected articles from the Wikipedia dump;

    • Wiki-simple contains 130,000 randomly selected articles from the Simple Wikipedia dump;

    • Wiki-balanced contains 65,000 randomly selected articles from the Wikipedia dump (dated 26 January 2018) and 65,000 randomly selected articles from the Simple Wikipedia dump.

  • KRES-balanced: KRES corpus (Logar et al., 2012) is a 100 million word balanced reference corpus of Slovenian language. 35% of its content are books, 40% periodicals, and 20% internet texts. From this corpus we took all the available documents from two children magazines (Ciciban and Cicido), all documents from four teenager magazines (Cool, Frka, PIL plus and Smrklja), and documents from three magazines targeting adult audiences (Življenje in tehnika, Radar, City magazine). With these texts we built a corpus with approximately 2.4 million words. The corpus is balanced in a sense that about one third of the sentences come from documents targeting children, one third is targeting teenagers, and the last third is targeting adults.

Grade #documents #tokens
2nd 224 74,428
3rd 500 197,992
4th 1,569 923,828
5th 1,342 912,411
6th 1,058 802,057
7th 1,210 979,471
8th 1,037 890,358
9th 750 637,784
10th 20 19,012
11th 2 1,130
12th 1,853 1,833,781
all 9,565 7,272,252
Table 1: The number of English articles and tokens per specific grade in the Newsela corpus.
Grade #school books #chunks #tokens
primary school - 1st 8 85 13,034
primary school - 2nd 7 181 30,368
primary school - 3rd 7 334 62,241
primary school - 4th 13 1,258 265,647
primary school - 5th 15 1,480 330,340
primary school - 6th 12 1,196 279,677
primary school - 7th 13 1,837 463,109
primary school - 8th 15 2,304 541,202
primary school - 9th 16 2,689 688,310
secondary school - 1st 11 2,077 578,968
secondary school - 2nd 4 737 206,396
secondary school - 3rd 3 662 166,060
secondary school - 4th 1 56 14,313
all 125 14,896 3,639,665
Table 2: The number of school books, text chunks and tokens per grade in the corpus of Slovenian school books.

4 Unsupervised neural approach

In this section, we explore how language models can be used for determining readability of the text by injecting discourse cohesion and background knowledge information into the measurement of readability. In Section 4.1, we describe the methodology, and in Section 4.2.2, we present the results of the conducted experiments.

4.1 Methodology

The main tool we use for assessment of readability in an unsupervised setting are neural language models, described in Section 2.4. We use three types of architectures for neural language models, recurrent (LSTM), convolutional, and transformer neural networks. Two main questions we wish to investigate in the unsupervised approach are the following:

  • Can language models be used independently for unsupervised readability prediction?

  • Can we develop a robust new readability formula that will outperform traditional readability formulas by relying not only on shallow lexical sophistication indicators but also on background knowledge and discourse cohesion indicators?

4.1.1 Language models for unsupervised readability assessment

The findings of the related research suggest that a separate language model should be trained for each readability class in order to extract features for successful readability prediction (Petersen and Ostendorf, 2009; Xia, Kochmar, and Briscoe, 2016). However, as neural language models capture much more information compared to the traditional n-gram models, we test the possibility of using a single neural language model for the unsupervised readability prediction. We hypothesize that a language model, trained on a corpus with a similar amount of content for different age groups, shall return lower perplexity for more standard, predictable (i.e. readable) texts. The intuition behind this hypothesis is that complex and rare language structures and vocabulary of less readable texts would negatively affect the performance of the language model, expressed via larger perplexity score.

To test this hypothesis, we train language models on Wiki-normal, Wiki-simple, and Wiki-balanced corpora described in Section 3. We expect the following results:

  • Training the language models on a balanced corpus containing the same number of texts for adults and children (Wiki-balanced corpus) would positively effect the correlation between the language model performance and readability, since all our test corpora (WeeBit, OneStopEnglish and Newsela) contain texts meant for children and young adults.

  • The language models trained only on texts for adults (Wiki-normal) will show higher perplexity on texts for children, since their training set did not contain such texts; this will negatively effect the correlation between the language model performance and readability.

  • Training the language models only on texts for children (Wiki-simple corpus) will result in a higher perplexity score of the language model when applied to adult texts. This will positively effect the correlation between the language models’ performance and readability. However, this language model will not be able to reliably distinguish between texts for different age groups of young adults and teenagers, which will have a negative effect on the correlation.

Note that all three Wiki corpora contain the same amount of articles, in order to make sure that the training set size does not influence the results of the experiments.

To further test the viability of the hypothesis presented above and to test the limits of using a single language model for unsupervised readability prediction, we also explore the possibility of using a language model trained on a large general corpus for the unsupervised readability prediction.

4.1.2 Ranked sentence readability score

Based on the two considerations below, we propose a new Ranked Sentence Readability Score (RSRS) for measuring the readability with language models.

  • The shallow lexical sophistication indicators, such as the length of a sentence, correlate well with the readability of a text. Using them besides statistics derived from language models could improve the unsupervised readability prediction.

  • The perplexity score used for measuring the performance of a language model is an unweighted sum of perplexities of words in the predicted sequence. In reality, a small amount of unreadable words might drastically reduce the readability of the entire text. Assigning larger weights to such words might improve the correlation of language model scores with the readability.

The proposed readability score is calculated with the following procedure. First, a given text is split into sentences with the default sentence tokenizer from the NLTK library (Bird and Loper, 2004). In order to get a readability estimation for each word in a specific context, we compute, for each word in the sentence, the word negative log-likelihood (WNLL) according to the following formula:

where denotes the probability (from the softmax distribution) predicted by the language model according to the historical sequence, and denotes the true probability distribution of a word. The has the value 1 for the word in the vocabulary that actually appears next in the sequence and the value 0 for all the other words in the vocabulary. Next, we sort all the words in the sentence in ascending order according to their WNLL score and the ranked sentence readability score (RSRS) is calculated with the following expression:


where denotes the sentence length and represents the rank of a word in a sentence according to its WNLL value. The square root of the word rank is used for proportionally weighting words according to their readability, since initial experiments suggested that the use of a square root of a rank represents the best balance between allowing all words to contribute equally to the overall readability of the sentence and allowing only the least readable words to affect the overall readability of the sentence. For out of vocabulary words, square root rank weights are doubled, since these rare words are in our opinion good indicators of non-standard text. Finally, in order to get the readability score for the entire text, we calculate the average of all the RSRS scores in the text. An example of how RSRS is calculated for a specific sentence is shown in Figure 1.

Figure 1: The RSRS calculation for the sentence This could make social interactions easier for them.

The main idea behind the RSRS score is to avoid the reductionism of traditional readability formulas. We aim to achieve this by including discourse cohesion and background knowledge through language model based statistics. The first assumption is that low discourse cohesion has a negative effect on the performance of the language model, resulting in a higher WNLL for words in complex grammatical and lexical contexts. The second assumption is that the background knowledge is included in the readability calculation: tested documents with semantics dissimilar to the documents in the language model training set will negatively affect the performance of the language model, resulting in the higher WNLL score for words with unknown semantics. The trainable nature of language models allows for customization and personalization of the RSRS for specific tasks, topics and languages. This means that RSRS shall alleviate the problem of cultural non-transferability of traditional readability formulas.

On the other hand, the RSRS also leverages shallow lexical sophistication indicators through the index weighting scheme which makes sure that less readable words contribute more to the overall readability score. This is somewhat similar to the counts of long and difficult words in the traditional readability formulas, such as GFI and DCRF. The value of RSRS also increases for texts containing longer sentences, since the square roots of the word rank weights become larger with increased sentence length. This is similar to the behaviour of traditional formulas such as GFI, FRE, FKGL, ARI, DCRF, where this effect is achieved by incorporating the ratio between the total number of words and the total number of sentences into the equation.

4.2 Unsupervised experiments

For the presented unsupervised readability assessment methodology based on neural language models, we first present the experimental design followed by the results.

4.2.1 Experimental design

Three different architectures of language models (described in Section 2.4) are used for experiments: a convolutional word level language model (CLM) proposed by Bai, Kolter, and Koltun (2018), a recurrent language model (RLM) proposed by Kim et al. (2016), and an attention based language model BERT (Devlin et al., 2018). For the experiments on English language, we train CLM and RLM on three Wiki corpora. To explore the possibility of using a language model trained on a general corpus for the unsupervised readability prediction, we use a pretrained BERT language model trained on the Google Books Corpus (Goldberg and Orwant, 2013) (800M words) and Wikipedia (2,500M words) for the experiments on English. For the experiments on Slovenian language, corpora containing just texts for children are too small for efficient training of language models, therefore CLM and RLM were only trained on the KRES-balanced corpus described in Section 3. For exploring the possibility of using a general language model for the unsupervised readability prediction, a pretrained BERT multilingual language model trained on Wikipedia dumps of hundred languages with the largest Wikipedia, including Slovenian, is used.

The performance of language models is typically measured with the perplexity (see Eq. (1)). To answer the research question if language models can be used independently for unsupervised readability prediction, we investigate how the measured perplexity of language models correlates with the readability labels in the gold-standard WeeBit, OneStopEnglish, Newsela, and Slovenian school books corpora described in Section 3. The correlation to these ground truth readability labels is also used to evaluate the performance of the RSRS measure. For performance comparison, we calculate the traditional readability formula values (described in Section 2) for each document in the gold-standard corpora and also measure the correlation between these values and manually assigned labels. As a baseline we use the average sentence length in each document.

The correlation is measured with the Pearson correlation coefficient (). Given a pair of distributions and , the covariance

, and the standard deviation

, the formula for is:

A larger positive correlation signifies a better performance for all measures except the FRE readability measure. As this formula assigns higher scores to better readable texts, a larger negative correlation suggests a better performance of the measure.

4.2.2 Experimental results

The results of the experiments are presented in Table 3. The average ranking of measures on English and Slovenian datasets are presented in Table 4.

Measure/Dataset WeeBit OneStopEnglish Newsela Slovenian SB
RLM perplexity-balanced -0.0819 0.405 0.512 0.303
RLM perplexity-simple -0.115 0.420 0.470 /
RLM perplexity-normal -0.127 0.283 0.341 /
CLM perplexity-balanced -0.0402 0.474 0.528 0.136
CLM perplexity-simple -0.0542 0.524 0.583 /
CLM perplexity-normal -0.117 0.292 0.270 /
BERT perplexity -0.123 -0.162 -0.673 -0.651
RLM RSRS-balanced 0.497 0.551 0.890 0.732
RLM RSRS-simple 0.506 0.569 0.893 /
RLM RSRS-normal 0.490 0.536 0.886 /
CLM RSRS-balanced 0.446 0.599 0.894 0.789
CLM RSRS-simple 0.451 0.615 0.896 /
CLM RSRS-normal 0.414 0.576 0.890 /
BERT RSRS 0.279 0.384 0.674 -0.301
GFI 0.544 0.550 0.849 0.730
FRE -0.433 -0.485 -0.775 -0.614
FKGL 0.544 0.533 0.865 0.697
ARI 0.488 0.520 0.875 0.658
DCRF 0.420 0.496 0.735 0.686
SMOG 0.456 0.498 0.813 0.770
Avg. sentence length 0.508 0.498 0.906 0.683
Table 3: Pearson correlation coefficient between manually assigned readability labels and the readability scores assigned by different readability measures in the unsupervised setting. The highest correlation for each corpus is marked with the bold typeface.
Measure Avg. rank ENG Abs. rank ENG Abs. rank SLO Diff.
RLM RSRS-simple 4.0 1.0 / /
CLM RSRS-simple 4.0 1.0 / /
Avg. sentence length 5.0 3.0 7.0 4.0
CLM RSRS-balanced 5.0 3.0 1.0 2.0
RLM RSRS-balanced 5.0 3.0 3.0 0.0
GFI 5.7 6.0 4.0 2.0
FKGL 6.0 7.0 5.0 2.0
RLM RSRS-normal 6.7 8.0 / /
CLM RSRS-normal 7.0 9.0 / /
ARI 8.3 10.0 8.0 2.0
SMOG 10.0 11.0 2.0 9.0
FRE 12.3 12.0 9.0 3.0
DCRF 12.7 13.0 6.0 7.0
CLM perplexity-simple 13.3 14.0 / /
BERT RSRS 15.3 15.0 12.0 3.0
CLM perplexity-balanced 15.3 15.0 11.0 4.0
RLM perplexity-balanced 17.0 17.0 10.0 7.0
RLM perplexity-simple 17.3 18.0 / /
CLM perplexity-normal 19.3 19.0 / /
RLM perplexity-normal 20.0 20.0 / /
BERT perplexity 20.7 21.0 13.0 8.0
Table 4: Ranking of measures on English and Slovenian datasets. The column Avg. rank ENG presents the average rank on three English datasets, the column Abs. rank ENG presents the ranking of measures according to their average rank on English datasets (absolute ranking according to the average rank score achieved by a specific measure), and the column Abs. rank SLO presents ranking of measures on the Slovenian school books corpus. The column Diff. presents the difference between the Abs. rank ENG and Abs. rank SLO ranking.

The correlation coefficient for all measures vary drastically between different corpora. The highest values are obtained on the Newsela corpus, where the best performing measure (surprisingly this is our baseline - the average sentence length) achieves the of 0.906. The highest on the other two English corpora are much lower. On the WeeBit corpus, the best performance is achieved by GFI and FKGL measures ( of 0.544) and on the OneStopEnglish corpus the best performance is achieved with the proposed CLM RSRS-simple ( of 0.615). On the Slovenian school books, the values are higher and the best performing measure is CLM RSRS score-balanced with of 0.789.

The perplexity-based measures show much lower correlation with the ground truth readability scores. Overall, they perform the worst of all the measures for both languages (see Table 4) but we can observe large differences in their performance across different corpora. While there is either no correlation or low negative correlation between perplexities of all three language models and readability on the WeeBit corpus, there is some correlation between perplexities achieved by RLM and CLM on OneStopEnglish and Newsela corpora (the highest being the of 0.583 achieved by CLM perplexity-simple on the Newsela corpus). The correlation between RLM and CLM perplexity measures and readability classes on the Slovenian school books corpus is low, with RLM perplexity-balanced showing the of 0.303 and CLM perplexity-balanced achieving of 0.136.

BERT perplexities are negatively correlated with readability and the negative correlation is relatively strong on Newsela and Slovenian school books corpora ( of -0.673 and -0.650, respectively) and weak on WeeBit and OneStopEnglish corpora. As BERT was trained on Wikipedia articles and Google books corpus, which are mostly aimed at adults, the results seem to suggest that BERT language model might actually be less perplexed by the articles aimed at adults than the documents aimed at younger audiences. This suggests that using language models trained on general corpora for the unsupervised readability prediction is, at least according to our results, not a viable option.

In regards to our hypothesis that a language model trained on a corpus with similar amount of content for different age groups shall achieve better performance on more readable texts, it is interesting to look at the differences in performance between CLM and RLM perplexity measures trained on Wiki-normal, Wiki-simple and Wiki-balanced corpora. Results on the WeeBit corpus are hard to interpret, since all perplexity measures show a weak negative correlation with the readability. On the OneStopEnglish corpus, both Wiki-simple perplexity measures perform the best, while on the Newsela corpus, RLM perplexity-balanced outperforms RLM perplexity-simple by 0.042 and CLM perplexity-simple outperforms CLM perplexity-balanced by 0.055. Both Wiki-normal perplexity measures are outperformed by a large margin by Wiki-simple and Wiki-balanced perplexity measures on the OneStopEnglish and the Newsela corpora. Similar observations can be made in regards to RSRS, which also leverages language model statistics. On all corpora Wiki-simple RSRS measures outperform Wiki-balanced RSRS measures and Wiki-balanced RSRS consistently outperforms Wiki-normal RSRS measures.

These results are not entirely compatible with our initial expectations that Wiki-balanced measures would be the most correlated with readability in most cases. On the other hand, the differences in performance between Wiki-balanced and Wiki-simple measures are not large and the positive correlation between readability and perplexity measures on the Newsela and OneStopEnglish corpora are quite strong which supports the hypothesis that more complex language structures and vocabularies of less readable texts would result in higher perplexity on these texts. According to our results, this phenomenon might not be very strong and only works if the training set is balanced in terms of readability classes for different ages. On the other hand, if the training set contains more texts for adults than for children, as in the case of language models trained just on the Wiki-normal corpus (and also BERT), this phenomenon disappears or even gets reverted, since language models trained on more complex language structures learn how to handle these difficulties.

The low performance of perplexity measures suggests that discourse cohesion and background knowledge leveraged by language models are not good indicators of readability and should therefore not be used in the readability formulas in the direct form. However, the results of CLM RSRS and RLM RSRS suggest that language models contain quite useful information when combined with other shallow lexical sophistication indicators. For English, the RLM RSRS-simple and the CLM RSRS-simple rank first with the average rank of 4.0. The CLM RSRS-balanced and RLM RSRS-balanced are the second best with the average rank of 5.0, together with the baseline average sentence length measure. CLM RSRS and RLM RSRS on Slovenian corpus also perform well with CLM RSRS-balanced being ranked first and RLM RSRS-balanced being the third. On the other hand, BERT RSRS is not well correlated with readability, with an average rank of 15.3 on the English corpora and the rank of 12.0 on the Slovenian corpus. This is not surprising, since all BERT perplexities are negatively correlated with the readability classes.

When it comes to cross-language transferability of readability measures (see column Diff. in Table 4), the most consistent ranking by performance is achieved by the RLM RSRS-balanced with no difference in ranking on English and Slovenian corpora. CLM RSRS-balanced, the best ranked measure on Slovenian corpus, also performs quite consistently with the difference in ranks of 2.0. Among the traditional measures, GFI presents the best balance in performance and consistency, ranking sixth on English and fourth on Slovenian. On the other hand, SMOG, which ranked very well on Slovenian (rank 2.0), ranked eleventh on English, which is the largest difference in ranking among all measures. The opposite can be said about the simplest readability measure, the average sentence length, which performed well on English (rank 3.0) and badly on Slovenian (rank 7.0).

To sum up, compared to perplexity scores and traditional readability measures, the proposed RSRS scores outperformed other scores on 2 out of 4 gold-standard datasets (see Table 3), achieved the best ranks, and showed the most stable cross-language performance (see Table 4).

5 Supervised neural approach

As mentioned in Section 2.5, recent trends in text classification show the domination of deep learning approaches which internally employ automatic feature construction. Surprisingly, even the most recent approaches to readability classification rely on hand crafted features and standard machine learning classifiers (Vajjala and Lucic, 2018; Xia, Kochmar, and Briscoe, 2016). In this Section, we describe how different types of neural classifiers can predict text readability and evaluate their performance.

The Section is divided into Section 5.1, where we describe the methodology, and Section 5.2, where we present the experimental scenario and the results of the conducted experiments.

5.1 Methodology

There exist several successful architectures of neural networks. We tested three distinct neural network approaches to text classification described in Section 2.5:

  • Bidirectional Long short-term memory network (BiLSTM). We use the RNN approach proposed by Conneau et al. (2017) for classification. The bidirectional LSTM layer is a concatenation of forward and backward LSTM layers that read documents in two opposite directions. The max and mean pooling are applied to the LSTM output feature matrix in order to get the maximum and average values of the matrix. The resulting vectors are concatenated and fed to a linear layer responsible for producing final predictions.

  • Hierarchichal attention networks (HAN). We use the identical architecture in this classifier as the one described in Yang et al. (2016) that takes hierarchical structure of text into an account through the two level attention mechanism (Bahdanau, Cho, and Bengio, 2014; Xu et al., 2015) applied to word and sentence representations encoded by bidirectional LSTMs.

  • Transfer learning. We use a pretrained BERT transformer architecture with 12 layers of size 768 and 12 self-attention heads. A linear classification head was added on top of the pretrained language model and the whole classification model was fine-tuned on every data set for 3 epochs. For English data sets, a pretrained uncased language model trained on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) was used, while for the Slovenian school books corpus, a multi-lingual uncased language model trained on Wikipedia dumps of hundred languages with the biggest Wikipedias was used (Devlin et al., 2018)888Models are available at

We randomly split the Newsela and Slovenian school books corpora into a train (80% of the corpus), validation (10% of the corpus) and test (10% of the corpus) sets. Due to the small number of documents in OneStopEnglish and WeeBit corpora (see description in Section 3), we used five-fold cross validation on these corpora to get more reliable results. For every fold, the corpora were split into the train (80% of the corpus), validation (10% of the corpus) and test (10% of the corpus) sets.

BiLSTM and HAN classifiers were trained on the train set and tested on the validation set after every epoch (for a maximum of 100 epochs), and the best performing model on the validation set was selected as the final model and produced predictions on the test sets. The validation sets were also used in a grid search to find the best hyperparameters of the models. For BiLSTM, all combinations of the following hyperparameter values were tested before choosing the best combination, which is written in bold in the list below:

  • Batch size: 8, 16, 32

  • Learning rates: 0.00005, 0.0001, 0.0002, 0.0004, 0.0008

  • Word embedding size: 100, 200, 400

  • LSTM layer size: 128, 256

  • Number of LSTM layers: 1, 2, 3, 4

  • Dropout after every LSTM layer: 0.2, 0.3, 0.4

For HAN, all combinations of the following hyperparameter values were tested (the best combination is written in bold in the list below):

  • Batch size: 8, 16, 32

  • Learning rates: 0.00005, 0.0001, 0.0002, 0.0004, 0.0008

  • Word embedding size: 100, 200, 400

  • Sentence embedding size: 100, 200, 400

We used the same configuration for all the corpora and performed no corpus specific tweaking of classifier parameters. We measured the performance of all the classifiers in terms of accuracy (in order to compare their performance to the performance of the classifiers from the related work), weighted average precision, weighted average recall, and weighted average F-score. We calculate the weighted average precision, weighted average recall, and weighted average F-score by first calculating the precision (

) and recall () for each class according to the following formulae:

are true positive predictions (documents correctly classified into class ), are false positive predictions (documents incorrectly classified into class ), and are false negative predictions (documents incorrectly classified into other classes instead of class ). The weighted average precision () and weighted average recall () are defined with the following equations:

Given a corpus with readability classes , the precision for class is weighted with the number of documents belonging to that readability class (). The same weighting scheme is used in a calculation of the weighted recall, where the recall for the class is weighted with the number of documents belonging to that readability class (

). The weighted average F-score is calculated as a weighted harmonic mean between

and according to the following formula:

5.2 Experimental results

The results of supervised readability assessment using different architectures of deep neural networks are presented in Table 5.

Measure/Dataset WeeBit OneStopEnglish Newsela Slovenian SB
BERT accuracy 0.8393 0.5895 0.5810 0.5047
BERT precision 0.8456 0.6041 0.5797 0.5063
BERT recall 0.8393 0.5895 0.5810 0.5047
BERT F1 0.8401 0.5770 0.5759 0.5033
HAN accuracy 0.7700 0.7895 0.8046 0.4859
HAN precision 0.7755 0.8121 0.8070 0.4900
HAN recall 0.7700 0.7895 0.8046 0.4859
HAN F1 0.7679 0.7892 0.8037 0.4818
BiLSTM accuracy 0.7818 0.7214 0.6943 0.5108
BiLSTM precision 0.7869 0.7531 0.7159 0.5269
BiLSTM recall 0.7818 0.7214 0.6943 0.5108
BiLSTM F1 0.7815 0.7200 0.7021 0.5127

Table 5: The results of the supervised approach to readability in terms of accuracy, weighted precision, weighted recall, and weighted F-score for the three neural network classifiers.

On the WeeBit corpus, by far the best performance according to all measures was achieved by BERT. In terms of accuracy, BERT outperforms the second best BiLSTM by almost 6 percentage points, achieving the accuracy of 83.93%. HAN performs the worst on the WeeBit corpus according to all measures. BERT also outperforms the best reported accuracy from the literature reported by Xia, Kochmar, and Briscoe (2016) using the five-fold cross validation setting. By achieving 80.3%, it is better by about 4.5% percentage points.

On the other hand, BERT performs poorly on the OneStopEnglish and Newsela corpora. On both corpora, it is outperformed by the best performing classifier (HAN) by about 20 percentage points according to all criteria. We suspect that the main reason for the bad performance of BERT on these two corpora is the semantic similarity between classes. In these two corpora, the simplified versions of the original texts contain the same message as the original texts, but written in a more simplistic way. The results of our experiments suggest that because BERT is pretrained as a language model, it tends to rely more on semantic than structural differences during the classification phase and therefore performs better on problems with distinct semantic differences between readability classes. This is the case with the WeeBit and Slovenian school books corpora but not with the OneStopEnglish and Newsela corpora.

The best performance on the OneStopEnglish corpus is achieved by the HAN classifier with the accuracy of 78.95% in the five-fold cross validation setting. This is slightly better than the state-of-the-art accuracy of 78.13% achieved by Vajjala and Lucic (2018) with their SMO classifier using 155 hand-crafted features. BiLSTM classifier performs substantially better than BERT on this corpus but still 6-7 percentage points lower than HAN.

Very similar ranking of the classifiers can be observed on the Newsela corpus. Here HAN substantially outperforms both BiLSTM and BERT with the F-score of 80.37%. While in the unsupervised setting the values on the Newsela corpus were substantially larger than on other corpora, this is not the case for performance measures in the supervised setting. Most likely the eleven readability classes of Newsela corpus present a much harder problem than for example only three readability classes of the OneStopEnglish corpus.

On the corpus of Slovenian school books, all classifiers achieve similar performance but BiLSTM outperforms other two classifiers according to all criteria. HAN performs the worst according to all criteria. In general, the performance of classifiers is the worst on this corpus, with the F-score of 51.27% achieved by BiLSTM being the best result. This can be partially attributed to a large number (thirteen) of readability classes in this corpus.

a) BERT b) HAN c) BiLSTM
Figure 2: Confusion matrices for BERT, HAN, and BiLSTM on the WeeBit corpus.
a) BERT b) HAN c) BiLSTM
Figure 3: Confusion matrices for BERT, HAN, and BiLSTM on the OneStopEnglish corpus.
a) BERT b) HAN c) BiLSTM
Figure 4: Confusion matrices for BERT, HAN, and BiLSTM on the Newsela corpus.
a) BERT b) HAN c) BiLSTM
Figure 5: Confusion matrices for BERT, HAN, and BiLSTM on the Slovenian school books corpus.

Since readability classes are ordinal variables, not all mistakes of classifiers are equal, i.e. classifications into a near readability class are less serious mistakes than classifications into more distant classes. Confusion matrices for classifiers give us a better insight into what kind of mistakes are specific for different classifiers. Confusion matrices for the WeeBit corpus (Figure

2) show that all the classifiers have the most problems with distinguishing between texts for children 8-9 years old and 9-10 years old. The mistakes where the text is falsely classified into an age group that is not neighbouring the correct age group are rare. For example, the best performing BERT classifier misclassified only fifteen documents into non-neighbouring classes.

Similar findings are true for the OneStopEnglish corpus (Figure 3). Here, the BERT classifier, which is performing the worst on this corpus, had the most problems correctly classifying documents from the intermediate class, misclassifying almosts two thirds of the documents. HAN and BiLSTM classifiers performed better, both misclassifying about one third of the documents from the intermediate class. Both classifiers had the least problems with documents from the advanced class, misclassifying approximately 15% of these documents.

Confusion matrices for the Newsela corpus (Figure 4

) follow a similar pattern, even though the number of classes is much larger and classes are unbalanced. Unsurprisingly, no classifier predicted any documents to be in two minority classes (10th and 11th grade) with very little training examples. The confusion matrix of the BERT classifier also clearly shows that this classifier has problems on this dataset, since the false predictions are more dispersed across classes than in the case of HAN and BiLSTM which classified a large majority of misclassified instances into neighbouring classes. The most visible error made by BERT is misclassifying 50 documents from the 12th grade into non-neighbouring classes. On the other hand, the best performing HAN classifier misclassified only four examples from the 12th grade and altogether misclassified only eleven examples into non-neighbouring classes.

Confusion matrices for the Slovenian school books corpus (Figure 5) are similar, which is unsurprising, provided that all classifiers achieved similar performance on this dataset. The biggest spread of misclassified documents is visible for the classes in the middle of readability range (from the 4th grade primary school to the 1st grade high school). Even though F-score results are relatively low on this dataset for all classifiers (the best F-score of 51.27% was achieved by BERT), all confusion matrices clearly show that a majority of misclassified examples were put into classes close to the correct one, suggesting that classification approaches to readability prediction can also be reliably used for Slovenian.

Overall, the classification results suggest that neural networks are a viable option for the supervised readability prediction,. Our approach managed to outperform all standard machine learning classifiers, leveraging extensive feature engineering (Xia, Kochmar, and Briscoe, 2016; Vajjala and Lucic, 2018), on both corpora, where comparisons are available.

6 Conclusion

We presented a set of novel approaches for determining readability of documents using deep neural networks. This is, to the best of our knowledge, the first attempt to leverage neural language models and neural network classifiers for readability prediction. The approaches are tested on a number of manually labeled English and Slovenian corpora. We improve the performance over current state-of-the-art approaches to readability prediction in both unsupervised and supervised settings.

The results suggest that unsupervised approaches to readability prediction that only take background knowledge and discourse cohesion into account cannot compete with the approaches based on shallow lexical sophistication indicators (e.g., sentence length, word length, etc.). However, combining the components of several readability indicators into the new RSRS (ranked sentence readability score) measure does improve the correlation with true readability scores. Additionally, the RSRS measure is adaptable, robust, and transferable across languages.

The functioning of the proposed RSRS measure can be customized and influenced by the choice of the training set. This is a desired property, since it enables personalization and localization of the readability measure according to the educational needs, language, and topic. The usability of this feature might be limited for under-resourced languages, since sufficient amount of documents needed to train a language model that can be used for the task of readability prediction in a specific customized setting might not be available. On the other hand, our experiments on the Slovenian language show, that a relatively small 2.4 million word training corpora for language models is sufficient to outperform traditional readability measures.

The results of the unsupervised approach to readability prediction on the corpus of Slovenian school books are not entirely consistent with the results reported by the previous Slovenian readability study (Škvorc et al., 2018), where the authors reported that simple indicators of readability, such as average sentence length, performed quite well. Our results show that the average sentence length performs very competitively on English but ranks badly on Slovenian. This inconsistency in results might be explained with the difference in corpora used for the evaluation of our approaches. While Škvorc et al. (2018)

conducted experiments on a corpus of magazines for different age groups (which we used for language model training), our experiments were conducted on a corpus of school books, which contains school books for sixteen distinct school subjects with very different topics ranging from literature, music and history to math, biology and chemistry. This might hint that the variance in genres and covered topics has an important effect on the ranking and performance of different readability measures. Further experiments on other Slovenian datasets, which we plan to conduct in the future, are required to confirm this hypothesis.

In the supervised approach to determining readability, we show that neural classifiers outperform state-of-the-art standard approaches on both corpora (WeeBit and OneStopEnglish) where comparison is available. However, the performance of different classifiers varies across different corpora, which is especially true for the BERT classifier. We hypothesize that this is due to its language model pretraining with focus on language understanding tasks, which makes the classifier sensitive to semantic information and therefore not appropriate for distinguishing between documents from different readability classes with similar meaning. More consistent behaviour is achieved by the HAN classifier that manages to outperform state-of-the-art approach proposed by Vajjala and Lucic (2018) on the OneStopEnglish corpus. Experiments also show that the attention based HAN classifier might be more appropriate for readability classification than the BiLSTM classifier, most likely due to more comprehensive context information. Even though BiLSTM slightly outperforms HAN on two out of four corpora, it is surpassed by a large margin on the other two corpora by HAN. These two corpora are OneStopEnglish and Newsela, where documents from different readability classes are semantically similar, which suggests that HAN classifier might be better capable of leveraging syntactic and structural information and relies less on semantic differences.

The differences in performance between classifiers on different corpora suggest that tested classifiers take different types of information into account. Provided this hypothesis is correct, some gains in performance might be achieved if these classifiers are combined. We plan to test a neural ensemble approach for the task of predicting readability in the future.

A more detailed look into confusion matrices of all classifiers on all corpora shows that the most common mistake all classifiers make is to misclassify a document into a neighbouring class. This makes our classification approaches to readability relatively informative and reliable even on the corpus of Slovenian school books, where the best F-score is relatively low compared to the very high results on the English corpora. The ordinal nature of readability classes will be further explored and exploited in the future work, when supervised (ordinal) regression approaches for determining readability will be tested.

We also plan to test the cross-genre and cross-language transferability of the proposed supervised and unsupervised approaches. This requires new readability datasets for different languages and genres which are currently rare or not publicly available. This might open opportunity to further improve the proposed unsupervised readability score.

The research was financially supported by the European social fund and Republic of Slovenia, Ministry of Education, Science and Sport through project Quality of Slovene textbooks (KaUČ). The work was supported by the Slovenian Research Agency (ARRS) core research programmes P6-0411 and P2-0103 and the project Terminology and knowledge frames across languages (J6-9372). This work has also received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 825153 (EMBEDDIA). The results of this publication reflect only the authors’ views and the EC is not responsible for any use that may be made of the information it contains.


  • Anderson (1981) Anderson, Jonathan. 1981. Analysing the readability of english and non-english texts in the classroom with lix. In Seventh Australian Reading Association Conference, pages 1–12, ERIC.
  • Bahdanau, Cho, and Bengio (2014) Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Bai, Kolter, and Koltun (2018) Bai, Shaojie, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.
  • Bengio, Simard, and Frasconi (1994) Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
  • Bird and Loper (2004) Bird, Steven and Edward Loper. 2004. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 31, Association for Computational Linguistics.
  • Chen and Goodman (1999) Chen, Stanley F and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394.
  • Collins-Thompson and Callan (2005) Collins-Thompson, Kevyn and Jamie Callan. 2005. Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology, 56(13):1448–1462.
  • Collobert et al. (2011) Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493–2537.
  • Conneau et al. (2017) Conneau, Alexis, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
  • Conneau et al. (2016) Conneau, Alexis, Holger Schwenk, Loïc Barrault, and Yann Lecun. 2016. Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781.
  • Crossley et al. (2017) Crossley, Scott A, Stephen Skalicky, Mihai Dascalu, Danielle S McNamara, and Kristopher Kyle. 2017. Predicting text comprehension, processing, and familiarity in adult readers: New approaches to readability formulas. Discourse Processes, 54(5-6):340–359.
  • Dale and Chall (1948) Dale, Edgar and Jeanne S Chall. 1948. A formula for predicting readability: Instructions. Educational research bulletin, pages 37–54.
  • Davison and Kantor (1982) Davison, Alice and Robert N Kantor. 1982. On the failure of readability formulas to define readable texts: A case study from adaptations. Reading research quarterly, pages 187–209.
  • Devlin et al. (2018) Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Council of Europe (2001) Council of Europe, Council for Cultural Co-operation. Education Committee. Modern Languages Division. 2001. Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge University Press.
  • Goldberg and Orwant (2013) Goldberg, Yoav and Jon Orwant. 2013. A dataset of syntactic-ngrams over time from a very large corpus of english books. In Second Joint Conference on Lexical and Computational Semantics, pages 241–247.
  • Goodfellow, Bengio, and Courville (2016) Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press.
  • Gunning (1952) Gunning, Robert. 1952. The technique of clear writing. McGraw-Hill, New York.
  • Howard and Ruder (2018) Howard, Jeremy and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146.
  • Kandel and Moles (1958) Kandel, Lilian and Abraham Moles. 1958. Application de l’indice de flesch à la langue française. Cahiers Etudes de Radio-Télévision, 19(1958):253–274.
  • Kim et al. (2016) Kim, Yoon, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Character-aware neural language models. In AAAI, pages 2741–2749.
  • Kincaid et al. (1975) Kincaid, J Peter, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Institute for Simulation and Training, University of Central Florida.
  • Kusner et al. (2015) Kusner, Matt, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In International Conference on Machine Learning, pages 957–966.
  • Logar et al. (2012) Logar, Nataša, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, Simon Krek, and Iztok Kosem. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES: gradnja, vsebina, uporaba. Trojina, zavod za uporabno slovenistiko.
  • Mc Laughlin (1969) Mc Laughlin, G Harry. 1969. Smog grading - a new readability formula. Journal of reading, 12(8):639–646.
  • Mikolov et al. (2011) Mikolov, Tomáš, Anoop Deoras, Stefan Kombrink, Lukáš Burget, and Jan Černockỳ. 2011. Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association.
  • Petersen and Ostendorf (2009) Petersen, Sarah E and Mari Ostendorf. 2009. A machine learning approach to reading level assessment. Computer speech & language, 23(1):89–106.
  • Rajpurkar et al. (2016) Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
  • Schwarm and Ostendorf (2005) Schwarm, Sarah E and Mari Ostendorf. 2005. Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 523–530, Association for Computational Linguistics.
  • Si and Callan (2001) Si, Luo and Jamie Callan. 2001. A statistical model for scientific readability. In Proceedings of the tenth international conference on Information and knowledge management, pages 574–576, ACM.
  • Škvorc et al. (2018) Škvorc, Tadej, Simon Krek, Senja Pollak, Špela Arhar Holdt, and Marko Robnik-Šikonja. 2018. Evaluation of statistical readability measures on slovene texts. In Conference on Language Technologies and Digital Humanities, pages 240–247, Ljubljana University Press, Faculty of arts.
  • Smith and Senter (1967) Smith, Edgar A and R.J. Senter. 1967. Automated readability index. AMRL-TR. Aerospace Medical Research Laboratories (US), pages 1–14.
  • Tang, Qin, and Liu (2015) Tang, Duyu, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1422–1432.
  • Tseng et al. (2019) Tseng, Hou-Chiang, Berlin Chen, Tao-Hsing Chang, and Yao-Ting Sung. 2019. Integrating lsa-based hierarchical conceptual space and machine learning methods for leveling the readability of domain-specific texts. Natural Language Engineering, 25(3):331–361.
  • Vajjala and Lucic (2018) Vajjala, Sowmya and Ivana Lucic. 2018. Onestopenglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 297–304, Association for Computational Linguistics.
  • Vajjala and Meurers (2012) Vajjala, Sowmya and Detmar Meurers. 2012. On improving the accuracy of readability classification using insights from second language acquisition. In Proceedings of the seventh workshop on building educational applications using NLP, pages 163–173, Association for Computational Linguistics.
  • Vaswani et al. (2017) Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Xia, Kochmar, and Briscoe (2016) Xia, Menglin, Ekaterina Kochmar, and Ted Briscoe. 2016. Text readability assessment for second language learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, pages 12–22.
  • Xu et al. (2015) Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.
  • Xu, Callison-Burch, and Napoles (2015) Xu, Wei, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association of Computational Linguistics, 3(1):283–297.
  • Yang et al. (2016) Yang, Zichao, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489.
  • Zhang, Zhao, and LeCun (2015) Zhang, Xiang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
  • Zhu et al. (2015) Zhu, Yukun, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In

    Proceedings of the IEEE international conference on computer vision

    , pages 19–27.
  • Zwitter Vitez (2014) Zwitter Vitez, Ana. 2014. Ugotavljanje avtorstva besedil: primer "trenirkarjev". In Language technologies: Proceedings of the 17th International Multiconference Information Society - IS 2014, pages 131–134.