Readability assessment poses the task of identifying the appropriate reading level for text. Such labeling is useful for a variety of groups including learning readers and second language learners. Readability assessment systems generally involve analyzing a corpus of documents labeled by editors and authors for reader level. Traditionally, these documents are transformed into a number of linguistic features that are fed into simple models like SVMs and MLPs (Schwarm and Ostendorf, 2005; Vajjala and Meurers, 2012).
More recently, readability assessment models utilize deep neural networks and attention mechanisms(Martinc et al., 2019). While such models achieve state-of-the-art performance on readability assessment corpora, they struggle to generalize across corpora and fail to achieve perfect classification. Often, model performance is improved by gathering additional data. However, readability annotations are time-consuming and expensive given lengthy documents and the need for qualified annotators. A different approach to improving model performance involves fusing the traditional and modern paradigms of linguistic features and deep learning. By incorporating the inductive bias provided by linguistic features into deep learning models, we may be able to reduce the limitations posed by the small size of readability datasets.
In this paper, we evaluate the joint use of linguistic features and deep learning models. We achieve this fusion by simply taking the output of deep learning models as features themselves. Then, these outputs are joined with linguistic features to be further fed into some other model like an SVM. We select linguistic features based on a broad psycholinguistically-motivated composition by Vajjala Balakrishna (2015). Transformers and Hierarchical attention networks were selected as the deep learning models because of their state-of-art performance in readability assessment. Models were evaluated on two of the largest available corpora for readability assessment: WeeBit and Newsela. We also evaluate with different sized training sets to investigate the use of linguistic features in data-poor contexts. Our results find that, given sufficient training data, the linguistic features do not provide a substantial benefit over deep learning methods.
The rest of this paper is organized as follows. Related research is described in section 2. Section 3 details our preprocessing, features, and model construction. Section 4 presents model evaluations on two corpora. Section 5 discusses the implications of our results.
We provide a publicly available version of the code used for our experiments.111https://github.com/TovlyDeutsch/Linguistic-Features-for-Readability
2 Related Work
Work on readability assessment has involved progress on three core components: corpora, features, and models. While early work utilized small corpora, limited feature sets, and simple models, modern research has experimented with a broad set of features and deep learning techniques.
Labeled corpora can be difficult to assemble given the time and qualifications needed to assign a text a readability level. The size of readability corpora expanded significantly with the introduction of the WeeklyReader corpus by Schwarm and Ostendorf (2005). Composed of articles from an educational magazine, the WeeklyReader corpus contains roughly 2,400 articles. The WeeklyReader corpus was then built upon by Vajjala and Meurers (2012) by adding data from the BBC Bitesize website to form the WeeBit corpus. This WeeBit corpus is larger, containing roughly 6,000 documents, while also spanning a greater range of readability levels. Within these corpora, topic and readability are highly correlated. Thus, Xia et al. (2016) constructed the Newsela corpus in which each article is represented at multiple reading levels thereby diminishing this correlation.
Early work on readability assessment, such as that of Flesch (1948), extracted simple textual features like character count. More recently, Schwarm and Ostendorf (2005) analyzed a broader set of features including out-of-vocabulary scores and syntactic features such as average parse tree height. Vajjala and Meurers (2012) assembled perhaps the broadest class of features. They incorporated measures shown by Lu (2010) to correlate well with second language acquisition measures, as well as psycholinguistically relevant features from the Celex Lexical database and MRC Psycholinguistic Database (Baayen et al., 1995; Wilson, 1988).
Traditional feature formulas, like the Flesch formula, relied on linear models. Later work progressed to more complex related models like SVMs (Schwarm and Ostendorf, 2005). Most recently, state-of-art-performance has been achieved on readability assessment with deep neural network incorporating attention mechanisms. These approaches ignore linguistic features entirely and instead feed the raw embeddings of input words, relying on the model itself to extract any relevant features. Specifically, Martinc et al. (2019) found that a pretrained transformer model achieved state-of-the-art performance on the WeeBit corpus while a hierarchical attention network (HAN) achieved state-of-the-art performance on the Newsela corpus.
Deep learning approaches generally exclude any specific linguistic features. In general, a “feature-less” approach is sensible given the hypothesis that, with enough data, training, and model complexity, a model should learn any linguistic features that researchers might attempt to precompute. However, precomputed linguistic features may be useful in data-poor contexts where data acquisition is expensive and error-prone. For this reason, in this paper we attempt to incorporate linguistic features with deep learning methods in order to improve readability assessment.
The WeeBit corpus was assembled by Vajjala and Meurers (2012) by combining documents from the WeeklyReader educational magazine and the BBC Bitesize educational website. They selected classes to assemble a broad range of readability levels intended for readers aged 7 to 16. To avoid classification bias, they undersampled classes in order to equalize the number of documents in each class to 625. We term this downsampled corpus “WeeBit downsampled”. Following the methodologies of Xia et al. (2016) and Martinc et al. (2019), we applied additional preprocessing to the WeeBit corpus in order to remove extraneous material.
The Newsela corpus (Xia et al., 2016) consists of 1,911 news articles each re-written up to 4 times in simplified manners for readers at different reading levels. This simplification process means that, for any given topic, there exist examples of material on that topic suited for multiple reading levels. This overlap in topic should make the corpus more challenging to label than the WeeBit corpus. In a similar manner to the WeeBit corpus, the Newsela corpus is labeled with grade levels ranging from grade 2 to grade 12. As with WeeBit, these labels can either be treated as classes or transformed into numeric labels for regression.
3.1.3 Labeling Approaches
Often, readability classes within a corpus are treated as unrelated. These approaches use raw labels as distinct unordered classes. However, readability labels are ordinal, ranging from lower to higher readability. Some work has addressed this issue such as the readability models of Flor et al. (2013)
which predict grade levels via linear regression. To test different approaches to acknowledging this ordinality, we devised three methods for labeling the documents: “classification”, “age regression”, and “ordered class regression”.
The classification approach uses the classes originally given. This approach does not suppose any ordinality of the classes. Avoiding such ordinality may be desirable for the sake of simplicity.
“Age regression” applies the mean of the age ranges given by the constituent datasets. For instance, in this approach Level 2 documents from Weekly Reader would be given the label of 7.5 as they are intended for readers of ages 7-8. The advantage of age regression over standard classification is that it provides more precise information about the magnitude of readability differences.
Finally, “ordered class regression” assigns the classes equidistant integers ordered by difficulty. The least difficult class would be labeled “0”, the second least difficult class would be labeled “1” and so on. As with age regression, this labeling results in a regression rather than classification problem. This method retains the advantage of age regression in demonstrating ordinality. However, ordered regression labeling removes information about the relative differences in difficulty between the classes, instead asserting that they are equidistant in difficulty. The motivation behind this loss of information is that such age differences between classes may not directly translate into differences of difficulty. For instance, the readability difference between documents intended for 7 or 8 year-olds may be much greater than between documents intended for 15 or 16 year-olds because reading development is likely accelerated in younger years.
For final model inferences, we used the classification approach for comparison to previous work. For intermediary CNN models, all three approaches were tested. As the different approaches with CNN models produced insubstantial differences, other model types were restricted to the simple classification approach.
Motivated by the success in using linguistic features for modeling readability, we considered a large range of textual analyses relevant to readability. In addition to utilizing features posed in the existing readability research, we investigated formulating new features with a focus on syntactic ambiguity and syntactic diversity. This challenging aspect of language appeared to be underutilized in existing readability literature.
3.2.1 Existing Features
To capture a variety of features, we utilized existing linguistic feature computation software222This code can be found at https://bitbucket.org/nishkalavallabhi/complexity-features. developed by Vajjala Balakrishna (2015) based on 86 feature descriptions in existing readability literature. Given the large number of features, in this section we will focus on the categories of features and their psycholinguistic motivations (where available) and properties. The full list of features used can be found in appendix A.
The most basic features involve what Vajjala and Meurers (2012) refer to as “traditional features” for their use in long-standing readability formulae. They include characters per word, syllables per word, and traditional formulas based on such features like the Flesch-Kincaid formula (Kincaid et al., 1975).
Another set of feature types consists of counts and ratios of part-of-speech tags, extracted using the Stanford parser (Klein and Manning, 2003). In addition to basic parts of speech like nouns, some features include phrase level constituent counts like noun phrases and verb phrases. All of these counts are normalized by either the number of word tokens or number of sentences to make them comparable across documents of differing lengths. These counts are not provided with any psycholinguistic motivation for their use; however, it is not an unreasonable hypothesis that the relative usage of these constituents varies across reading levels. Empirically, these features were shown to have some predictive power for readability. In addition to parts of speech counts, we also utilized word type counts as a simple baseline feature, that is, counting the number of instances of each possible word in the vocabulary. These counts are also divided by document length to generate proportions.
Becoming more abstract than parts of speech, some features count complex syntactic constituent like clauses and subordinated clauses. Specifically, Lu (2010) found ratios involving sentences, clauses, and t-units333Defined by Vajjala and Meurers (2012) to be “one main clause plus any subordinate clause or non-clausal structure that is attached to or embedded in it”. that correlated with second language learners’ abilities to read a document. For many of the multi-word syntactic constituents previously described, such as noun phrases and clauses, features were also constructed of their mean lengths. Finally, properties of the syntactic trees themselves were analyzed such as their mean heights.
Moving beyond basic features from syntactic parses, Vajjala Balakrishna (2015) also incorporated “word characteristic” features from linguistic databases. A significant source was the Celex Lexical Database Baayen et al. (1995) which “consists of information on the orthography, phonology, morphology, syntax and frequency for more than 50,000 English lemmas”. The database appears to have a focus on morphological data such as whether a word may be considered a loan word and whether it contains affixes. It also contains syntactic properties that may not be apparent from a syntactic parse, e.g. whether a noun is countable. The MRC Psycholinguistic Database Wilson (1988) was also used with a focus on its age of acquisition ratings for words, an clear indicator of the appropriateness of a document’s vocabulary.
3.2.2 Novel Syntactic Features
We investigated additional syntactic features that may be relevant for readability but whose qualities were not targeted by existing features. These features were used in tandem with the existing linguistic features described previously; future work could utilize these novel feature independently to investigate their particular effect on readability information extraction. For generating syntactic parses, we used the PCFG (probabilistic context-free grammar) parser (Klein and Manning, 2003) from the Stanford Parser package.
Sentences can have multiple grammatical syntactic parses. Therefore, syntactic parsers produce multiple parses annotated with parse likelihood. It may seem sensible to use the number of parses generated as a measure of ambiguity. However, this measure is extremely sensitive to sentence length as longer sentences tend to have more possible syntactic parses. Instead, if this list of probabilities is viewed as a distribution, the standard deviation of this distribution is likely to correlate with perceptions of syntactic ambiguity.
The parse deviation, , of sentence is the standard deviation of the distribution of the most probable parse log probabilities for . If has less than valid parses, the distribution is taken from all the valid parses.
For large values of , can be significantly sensitive to sentence length: longer sentences are likely to have more valid syntactic parses and thus create low probability tails that increase standard deviation. To reduce this sensitivity, an alternative involves measuring the difference between the largest and mean parse probability.
is the difference between the largest parse log probability and the mean of the log probabilities of the most probable parses for a sentence s. If has less than valid parses, the mean is taken over all the valid parses.
As a compromise between parse investigation and the noise of implausible parses, we selected , , and as features to use in the models of this paper.
To capture the grammatical makeup of a sentence or document, we can count the usage of each part of speech (“POS”), phrase, or clause. The counts can be collected into a distribution. Then, the standard deviation of this distribution, , measures a sentence’s grammatical heterogeneity.
is the standard deviation of the distribution of POS counts for document .
Similarly, we may want to measure how this grammatical makeup differs from the composition of the document as a whole, a concept that might be termed syntactic uniqueness. To capture this concept, we measure the Kullback-Leibler divergence(Kullback and Leibler, 1951) between the sentence POS count distribution and the document POS count distribution.
Let be the distribution of POS counts for sentence in document . Let be the distribution of POS counts for document . Let be the number of sentences in .
A large range of model complexities were evaluated in order to ascertain the performance improvements, or lack thereof, of additional model complexity. In this section we will describe the specific construction and usage of these models for the experiments conducted in this paper, ordered roughly by model complexity.
SVMs, Linear Models, and Logistic Regression
. From the Scikit-Learn library, we also utilized the linear support vector classifier (an SVM with a linear kernel) and logistic regression classifier. As simplicity was the aim for these evaluations, no hyperparameter optimization was performed. The logistic regression classifier was trained using the stochastic average gradient descent (“sag”) optimizer.
and implemented using the Keras(Chollet and others, 2015) et al., 2015), and Magpie libraries.
The transformer (Vaswani et al., 2017) is a neural-network-based model that has achieved state-of-the-art results on a wide array of natural language tasks including readability assessment (Martinc et al., 2019). Transformers utilize the mechanism of attention which allows the model to attend to specific parts of the input when constructing the output. Although they are formulated as sequence-to-sequence models, they can be modified to complete a variety of NLP tasks by placing an additional linear layer at the end of the network and training that layer to produce the desired output. This approach often achieves state-of-the-art results when combined with pretraining. In this paper, we use the BERT (Devlin et al., 2019) transformer-based model that is pretrained on BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia. The model is then fine-tuned on a specific readability corpus such as WeeBit. The pretrained BERT model is sourced from the Huggingface transformers library (Wolf et al., 2019) and is composed of 12 hidden layers each of size 768 and 12 self-attention heads. The fine-tuning step utilizes an implementation by Martinc et al. (2019). Among the pretrained transformers in the Huggingface library, there are transformers that can accept sequences of size 128, 256, and 512. The 128 sized model was chosen based on the finding by Martinc et al. (2019) that it achieved the highest performance on the WeeBit and Newsela corpora. Documents that exceeded the input sequence size were truncated.
The Hierarchical attention network involves feeding the input through two bidirectional RNNs each accompanied by a separate attention mechanism. One attention mechanism attends to the different words within each sentence while the second mechanism attends to the sentences within the document. These hierarchical attention mechanisms are thought to better mimic the structure of documents and consequently produce superior classification results. The implementation of the model used in this paper is identical to the original architecture described by Yang et al. (2016) and was provided by the authors of Martinc et al. (2019) based on code by Nguyen (2020).
3.4 Incorporating Linguistic Features with Neural Models
The neural network models thus far described take either the raw text or word vector embeddings of the text as input. They make no use of linguistic features such as those described in section 3.2. We hypothesized that combining these linguistic features with the deep neural models may improve their performance on readability assessment. Although these models theoretically represent similar features to those prescribed by the linguistic features, we hypothesized that the amount of data and model complexity may be insufficient to capture them. This can be evidenced in certain models failure to generalize across readability corpora. Martinc et al. (2019) found that the BERT model performed well on the WeeBit corpus, achieving a weighted F1 score of 0.8401, but performed poorly on the Newsela corpus only achieving an F1 score of 0.5759. They posit that this disparity occurred “because BERT is pretrained as a language model, [therefore] it tends to rely more on semantic than structural differences during the classification phase and therefore performs better on problems with distinct semantic differences between readability classes”. Similarly a HAN was able to achieve better performance than BERT on the Newsela but performed substantially worse on the WeeBit corpus. Thus, under some evaluations the models have deficiencies and fail to generalize. Given these deficiencies, we hypothesized that the inductive bias provided by linguistic features may improve generalizability and overall model performance.
In order to weave together the linguistic features and neural models, we take the simple approach of using the single numerical output of a neural model as a feature itself, joined with linguistic features, and then fed into one of the simpler non-neural models such as SVMs. SVMs were chosen as the final classification model for their simplicity and frequent use in integrating numerical features. The output of the neural model could be any of the label approaches such as grade classes or age regressions described in section 3.1. While all these labeling approaches were tested for CNNs, insubstantial differences in final inferences led us to restrict intermediary results to simple classification for other model types.
3.5 Training and Evaluation Details
All experiments involved 5-fold cross validation. All neural-network-based models were trained with the Adam optimizer (Kingma and Ba, 2015) with learning rates of ,, and
for the CNN, HAN, and transformer respectively. The HAN and CNN models were trained for 20 and 30 epochs. The transformer models were fine-tuned for 3 epochs.
All results are reported as either a weighted F1 or macro F1 score. To calculate weighted F1, first the F1 score is calculated for each class independently, as if each class was a case of binary classification. Then, these F1 score are combined in a weighted mean in which each class is weighted by the number of samples in that class. Thus, the weighted F1 score treats each sample equally but prioritizes the most common classes. The macro F1 is similar to the weighted F1 score in that F1 scores are first calculated for each class independently. However, for the macro F1 score, the class F1 scores are combined in a mean without any weighting. Therefore, the macro F1 score treats each class equally but does not treat each sample equally, deprioritizing samples from large classes and prioritizing samples from small classes.
In this section we report the experimental results of incorporating linguistic features into readability assessment models. The two corpora, WeeBit and Newsela, are analyzed individually and then compared. Our results demonstrate that, given sufficient data, linguistic features provide little to no benefit compared to independent deep learning models. While the corpus experiment results demonstrate a portion of the approaches tested, the full results are available in appendix B
4.1 Newsela Experiments
For the Newsela corpus, while linguistic features were able to improve the performance of some models, the top performers did not utilize linguistic features. The results from the top performing models are presented in table 1.
|han SVM with HAN and linguistic features||0.8014|
|han SVM with HAN||0.7931|
|nondeep SVM with linguistic features and Flesch Features||0.7694|
|transformer SVM with transformer and linguistic features||0.7678|
|transformer SVM with transformer, Flesch features, and linguistic features||0.7627|
|nondeep SVM with Linguistic features||0.7582|
|cnn SVM with CNN age regression and linguistic features||0.7281|
|cnn SVM with CNN ordered classes regression and linguistic features||0.7231|
|transformer SVM with transformer and Flesch features||0.7186|
While the HAN performance was not surpassed by models with linguistic features, the transformer models were. This improvement indicates that linguistic features capture readability information that transformers cannot capture or have insufficient data to learn. The outsize effect of adding the linguistic features to the transformer models, resulting in a weighted F1 score improvement of 0.22, may reveal what types of information they address. Martinc et al. (2019) hypothesize that a pretrained language model “tends to rely more on semantic than structural differences” indicating that these features are especially suited to providing non-semantic information such as syntactic qualities.
4.2 WeeBit Experiments
The WeeBit corpus was analyzed in two perspectives: the downsampled dataset and the full dataset. Raw results and model rankings were largely comparable between the two dataset sizes.
4.2.1 Downsampled WeeBit Experiments
As with the Newsela corpus, the downsampled WeeBit corpus demonstrates no gains from being analyzed with linguistic features. The best performing model, a transformer, did not utilize linguistic features. The results for some of the best performing models are shown in table 2.
|transformer SVM with transformer, Flesch features, and linguistic features||0.8381|
|transformer SVM with transformer and Flesch features||0.8359|
|transformer SVM with transformer and linguistic features||0.8344|
|transformer SVM with transformer||0.8343|
|nondeep Logistic regression classifier with word types, Flesch features, and linguistic features||0.8135|
|nondeep Logistic regression classifier with word types||0.7894|
|nondeep Logistic regression classifier with word types, word count, and Flesch features||0.7934|
|cnn SVM with CNN classifier and linguistic features||0.7923|
|nondeep Logistic regression classifier with word types and word count||0.7908|
Differing with the Newsela corpus, the word type models performed near the top results on the WeeBit corpus comparably to the transformer models. Word type models have no access to word order, thus semantic and topic analysis form their core analysis. Therefore, this result supports the hypothesis of Martinc et al. (2019) that the pretrained transformer is especially attentive to semantic content. This result also indicates that the word type features can provide a significant portion of the information needed for successful readability assessment.
The differing best performing model types between the two corpora are likely due to differing compositions. Unlike the Newsela corpus, the WeeBit corpus shows strong correlation between topic and difficulty. Extracting this topic and semantic content is thought to be a particular strength of the transformer (Martinc et al., 2019) leading to its improved results on this corpus.
4.2.2 Full WeeBit Experiments
All of the models were also tested on the full imbalanced WeeBit corpus, the top performing results of which are shown in table 3. Most performance figures increased modestly. However, these gains may not be seen if documents do not match the distribution of this imbalanced dataset. Additionally, the ranking of models between the downsampled and standard WeeBit corpora showed little change. Although the SVM with transformer and linguistic features performed better than the transformer alone, this difference is extremely small () and thus not likely to be statistically significant.
|white Features||Weighted F1|
|transformer SVM with transformer and linguistic features||0.8769|
|transformer SVM with transformer and Flesch features||0.8746|
|transformer SVM with transformer||0.8729|
|transformer SVM with transformer, Flesch features, and linguistic features||0.8721|
4.3 Effects of Training Set Size
One hypothesis explaining the lack of effect of linguistic features is that models learn to extract those features given enough data. Thus, perhaps in more data-poor environments the linguistic features would prove more useful. To test this hypothesis, we evaluated two CNN-based models, one with linguistic features and one without, with various sized training subsets of the downsampled WeeBit corpus. The macro F1 at these various dataset sizes is shown in fig. 1. Across the trials at different training set sizes, the test set is held constant thereby isolating the impact of training set size.
The hypothesis holds true for extremely small subsets of training data, those with fewer than 200 documents. Above this training set size, the addition of linguistic features results in insubstantial changes in performance. Thus, either the patterns exposed by the linguistic features are learnable with very little data or the patterns extracted by deep learning models differ significantly from the linguistic features. The latter appears more likely given that linguistic features are shown to improve performance for certain corpora (Newsela) and model types (transformers).
This result indicates that the use of linguistic features should be considered for small datasets. However, the dataset size at which those features lose utility is extremely small. Therefore, collecting additional data would often be more efficient than investing the time to incorporate linguistic features.
4.4 Effects of Linguistic Features
Overall, the failure of linguistic features to improve state-of-the-art deep learning models indicates that, given the available corpora, model complexity, and model structures, they do not add information over and beyond what the state-of-the-art models have already learned. However, in certain data-poor contexts, they can improve the performance of deep learning models. Similarly, with more diverse and more accurately and consistently labeled corpora, the linguistic features could prove more useful. It may be the case that the best performing models already achieve near the maximal possible performance on this corpus. The reason the maximal performance may be below a perfect score (an F1 score of 1) is disagreement and inconsistency in dataset labeling. Presumably the dataset was assessed by multiple labelers who may not have always agreed with one another or even with themselves. Thus, if either a new set of human labelers or the original labelers are tasked with labeling readability in this corpus, they may only achieve performance similar to the best performance seen in these experiments. Performing this human experiment would be a useful analysis of corpus validity and consistency. Similarly, a more diverse corpus (differing in length, topic, writing style, etc.) may prove more difficult for the models to label alone without additional training data; in this case, the linguistic features may prove more helpful in providing inductive bias.
Additionally, the lack of improvement from adding linguistic features indicates that deep learning models may already be representing those features. Future work could probe the models for different aspects of the linguistic features, thereby investigating what properties are most relevant for readability.
In this paper we explored the role of linguistic features in deep learning methods for readability assessment, and asked: can incorporating linguistic features improve state-of-the-art models? We constructed linguistic features focused on syntactic properties ignored by existing features. We incorporated these features into a variety of model types, both those commonly used in readability research and more modern deep learning methods. We evaluated these models on two distinct corpora that posed different challenges for readability assessment. Additional evaluations were performed with various training set sizes to explore the inductive bias provided by linguistic features. While linguistic features occasionally improved model performance, particularly at small training set sizes, these models did not achieve state-of-the-art performance.
Given that linguistic features did not generally improve deep learning models, these models may be already implicitly capturing the features that are useful for readability assessment. Thus, future work should investigate to what degree the models represent linguistic features, perhaps via probing methods.
Although this work supports disusing linguistic features in readability assessment, this assertion is limited by available corpora. Specifically, ambiguity in the corpora construction methodology limits our ability to measure label consistency and validity. Therefore, the maximal possible performance may already be achieved by state-of-the-art models. Thus, future work should explore constructing and evaluating readability corpora with rigorous consistent methodology; such corpora may be assessed most effectively using linguistic features. For instance, accuracy could be improved by averaging across multiple labelers.
Overall, linguistic features do not appear to be useful for readability assessment. While often used in traditional readability assessment models, these features generally fail to improve the performance of deep learning methods. Thus, this paper provides a starting point to understanding the qualities and abilities of deep learning models in comparison to linguistic features. Through this comparison, we can analyze what types of information these models are well-suited to learning.
- TensorFlow: large-scale machine learning on heterogeneous systems. Technical report External Links: Cited by: §3.3.
- The CELEX lexical database. Cited by: §2, §3.2.1.
- Keras. External Links: Cited by: §3.3.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: §3.3.
- A new readability yardstick. Journal of Applied Psychology 32 (3), pp. 221–233. External Links: Cited by: §2.
Lexical tightness and text complexity.
Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility, Atlanta, Georgia, pp. 29–38. External Links: Cited by: §3.1.3.
- A practical guide to support vector classification. Technical report Cited by: §3.3.
- Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751. External Links: Cited by: §3.3.
- Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel:. Technical report Defense Technical Information Center, Fort Belvoir, VA (en). External Links: Cited by: §3.2.1.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §3.5.
- Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03, Vol. 1, Sapporo, Japan, pp. 423–430 (en). External Links: Cited by: §3.2.1, §3.2.2.
- On Information and Sufficiency. The Annals of Mathematical Statistics 22 (1), pp. 79–86 (EN). External Links: Cited by: §3.2.2.
- Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods 44 (4), pp. 978–990. External Links: Cited by: Appendix A.
- Automatic analysis of syntactic complexity in second language writing. International Journal of Corpus Linguistics 15 (4), pp. 474–496 (English (US)). External Links: Cited by: §2, §3.2.1.
- Supervised and unsupervised neural approaches to text readability. Computing Research Repository arXiv:1503.06733. Note: version 2 External Links: Cited by: §1, §2, §3.1.1, §3.3, §3.3, §3.4, §4.1, §4.2.1, §4.2.1.
- Hierarchical Attention Networks for Document Classification. Note: original-date: 2019-01-31T18:56:40Z External Links: Cited by: §3.3.
- Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825−2830. External Links: Cited by: §3.3.
Reading Level Assessment Using Support Vector Machines and Statistical Language Models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, Stroudsburg, PA, USA, pp. 523–530. Note: event-place: Ann Arbor, Michigan External Links: Cited by: §1, §2, §2, §2.
- Analyzing Text Complexity and Text Simplification: Connecting Linguistics, Processing and Educational Applications. Dissertation, Universität Tübingen, (en). External Links: Cited by: §1, §3.2.1, §3.2.1.
- On Improving the Accuracy of Readability Classification using Insights from Second Language Acquisition. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, Montréal, Canada, pp. 163–173. External Links: Cited by: Appendix A, §1, §2, §2, §3.1.1, §3.2.1, footnote 3.
- Attention is all you need. In Advances in neural information processing systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: §3.3.
- MRC psycholinguistic database: Machine-usable dictionary, version 2.00. Behavior Research Methods, Instruments, & Computers 20 (1), pp. 6–10 (en). External Links: Cited by: §2, §3.2.1.
- HuggingFace’s transformers: state-of-the-art natural language processing. Technical report External Links: Cited by: §3.3.
- Text Readability Assessment for Second Language Learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, San Diego, CA, pp. 12–22. External Links: Cited by: §2, §3.1.1, §3.1.2.
- Hierarchical Attention Networks for Document Classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1480–1489. External Links: Cited by: §3.3.
- Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books. Computing Research Repository arXiv:1506.06724. External Links: Cited by: §3.3.
Appendix A Feature Definitions
For the following definitions, if the a ratio is undefined (i.e. the denominator is zero) the result is treated as zero. Vajjala and Meurers (2012) define complex nominals to be: “a) nouns plus adjective, possessive, prepositional phrase, relative clause, participle or appositive, b) nominal clauses, c) gerunds and infinitives in subject positions.” Here polysyllabic means more than two syllables and “long words” means a word with seven or more characters. Descriptions of the norms of age of acquisition ratings can be found in Kuperman et al. (2012).
|The parse deviation, , of sentence is the standard deviation of the distribution of the most probable parse log probabilities for . If has less than valid parses, the distribution is taken from all the valid parses.|
|is the difference between the largest parse log probability and the mean of the log probabilities of the most probable parses for a sentence s. If has less than valid parses, the mean is taken over all the valid parses.|
|is the standard deviation of the distribution of POS counts for document .|
|Let be the distribution of POS counts for sentence in document . Let be the distribution of POS counts for document . Let be the number of sentences in .|
|mean t-unit lenght||number of words / number of t-units|
|mean parse tree height per sentence||mean parse tree height / number of sentences|
|subtrees per sentence||number of subtrees / number of sentences|
|SBARs per sentence||number of SBARs / number of sentences|
|NPs per sentence||number of NPs / number of sentences|
|VPs per sentence||number of VPs / number of sentences|
|PPs per sentence||number of PPs / number of sentences|
|mean NP size||number of children of NPs / number of NPs|
|mean VP size||number of children of VPs / number of VPs|
|mean PP size||number of children of PPs / number of PPs|
|WHPs per sentence||number of wh-phrases / number of sentences|
|RRCs per sentence||number of reduced relative clauses / number of sentences|
|ConjPs per sentence||number of conjunction phrases / number of sentences|
|clauses per sentence||number of clauses / number of sentences|
|t-units per sentence||number of t-units / number of sentences|
|clauses per t-unit||number of clauses / number of t-units|
|complex t-unit ratio||number of t-units that contain a dependent clause / number of t-units|
|dependent clauses per clause||number of dependent clauses / number of clauses|
|dependent clauses per t-unit||number of dependent clauses / number of t-units|
|coordinate clauses per clause||number of coordinate clauses / number of clauses|
|coordinate clauses per t-unit||number of coordinate clauses / number of t-units|
|complex nominals per clauses||number of complex nominals / number of clauses|
|complex nominals per t-unit||number of complex nominals / number of t-units|
|VPs per t-unit||number of VP / number of t-units|
|nouns per word||number of nouns / number of words|
|proper nouns per word||number of proper nouns / number of words|
|pronouns per word||number of pronouns / number of words|
|conjuctions per word||number of conjuctions / number of words|
|adjectives per word||number of adjectives / number of words|
|verbs per word||number of verbs / number of words|
|adverbs per word||number of adverbs / number of words|
|modal verbs per word||number of modal verbs / number of words|
|prepositions per word||number of prepositions / number of words|
|interjections per word||number of interjections / number of words|
|personal pronouns per word||number of personal pronouns / number of words|
|wh-pronouns per word||number of wh-pronouns / number of words|
|lexical words per word||number of lexical words / number of words|
|function words per word||number of function words / number of words|
|determiners per word||number of determiners / number of words|
|VBs per word||number of base form verbs / number of words|
|VBDs per word||number of past tense verbs / number of words|
|VBGs per word||number of gerund or present participle verbs / number of words|
|VBNs per word||number of past participle verbs / number of words|
|VBPs per word||number of non-3rd person singular present verbs / number of words|
|VBZs per word||number of 3rd person singular present verbs / number of words|
|adverb variation||number of adverbs / number of lexical words|
|adjective variation||number of adjectives / number of lexical words|
|modal verb variation||number of adverbs and adverbs / number of lexical words|
|noun variation||number of nouns / number of lexical words|
|verb variation-I||number of verbs / number of unique verbs|
|verb variation-II||number of verbs / number of lexical words|
|squared verb variation-I||/ number of unique verbs|
|corrected verb variation-I||number of verbs /|
|AoA Kuperman||Mean age of acquisition of words (Kuperman database)|
|AoA Kuperman lemmas||Mean age of acquisition of lemmas|
|AoA Bird lemmas||Mean age of acquisition of lemmas, Bird norm|
|AoA Bristol lemmas||Mean age of acquisition of lemmas, Bristol norm|
|AoA Cortese and Khanna lemmas||Mean age of acquisition of lemmas, Cortese and Khanna norm|
|MRC familiarity||Mean word familiarity rating|
|MRC concreteness||Mean word concreteness rating|
|MRC Imageability||Mean word imageability rating|
|MRC Colorado Meaningfulness||mean word Colorado norms meaningfulness rating|
|MRC Pavio Meaningfulness||mean word Pavio norms meaningfulness rating|
|MRC AoA||Mean age of acquisition of words (MRC database)|
|number of sentences||number of sentences|
|mean sentence length||number of words / number of sentences|
|number of characters||number of characters|
|number of syllables||number of syllables|
|Automated Readability Index|
|Coleman Liau Formula|
|FORCAST Readability Formula|
|LIX Readability Formula|
|type token ratio||number of word types / number of word tokens|
|corrected type token ratio||number of word types /|
|root type token ratio||number of word types /|
|bilogorathmic type token ratio|
|measure of textual lexical diversity (MTLD)||see McCarthy and Jarvis, 2010|
|number of senses||total number of senses across all words / number of word tokens|
|hyeprnyms per word||number of hypernyms / number of word tokens|
|hyponyms per word||total number of senses hyponyms / number of word tokens|