1 Introduction
Text complexity is a concept inherently tied to the concept of readability. Readability meanwhile is commonly defined as "the sum total (including all the interactions) of all those elements within a given piece of printed material that affect the success a group of readers have with it" Dale1949TheReadability. That is, while readability is a function of both the text and a specific group of readers, text complexity is a function only of text, or a function of text and a generalised group of readers. There are certainly differences in what makes text difficult to read for different readers, for instance the difficulties a second language learner has might be very different from the difficulties of a reader with dyslexia or aphasia. Through focusing text complexity, we attempt to create a baseline model of complexity which can later on be adapted to the needs of different reader groups.
Modern models of readability analysis for classification often use classification algorithms such as SVM Petersen2007NaturalEducation,Feng2010AAssessment,Falkenjack2013FeaturesText which give us an assessment whether a text is easytoread or not. Such models have a very high accuracy, for instance, a model using 117 parameters from shallow, lexical, morphological and syntactic analyses achieves 98,9 % accuracy Falkenjack2013FeaturesText. However, these models do not tell us much about whether a given text is easier to read than any other text, other than the binary classification. In order to perform a more fine grained prediction we normally need to train the models using a corpus of graded texts, for an overview of such methods see CollinsThompson2014ComputationalResearch.
There are also attempts to grade texts without an extensive corpus of graded texts Pitler2008RevisitingQuality,TanakaIshii2010SortingReadability. TanakaIshii2010SortingReadability present an approach which predicts relative difficulty by modelling pairwise comparisons between texts. Another strength of this approach is that multiple corpora with text complexity annotated using different scales can be included in the same model. However, a downside of this approach is the necessity of squaring the number of training examples to model the full data set.
Texts released by different publishers of materials for readers with varying reading proficiency are often measured on different scales. Some publishers simply use an "easytoread" label, other label material with an intended age group, and others use a qualitative scale of difficulties where materials are placed in ordered categories. All these scales, though containing varying number of categories, in some sense measure the same thing. Text complexity can thus be viewed as a shared latent variable underlying different measures of readability.
In this paper we propose a new statistical model we refer to as MultiScale Probit, based on the well established Ordered Probit model. The Ordered Probit, as well as the traditional Binary assume that the response variable is measured on a single ordinal scale, or is classified with a binary label. The MultiScale Probit is able to take sets of data where the response variable is measured on different scales and find the shared latent feature, even without a vignette or "Rosetta stone" translating between the scales.
This allows us to use multiple nonoverlapping corpora with text complexity annotated on different scales and find the shared phenomenon of text complexity each annotation scale is based on. In other words, we can take a corpus with texts organised by target reader groups of different ages, a corpus with texts organised by degree of readability, and a corpus of easytoread texts, and put them all in the same model and estimate a shared model of text complexity.
In this study we apply this new model to both simulated and real data with promising results.
2 The MultiScale Probit model
In this section we will present our proposed model. For background we will start by introducing the original dichotomous Probit and the Ordered Probit before showing how this later model can be generalised to multiple scales.
2.1 The Probit Model
The Probit model is a well established statistical model for supervised statistical classification with some properties which makes it especially suitable to Bayesian modelling McCulloch:00. The Probit model takes the following form for the
th observation in the sample(1) 
where
is the Cumulative Distribution Function (CDF) for the Standard Normal distribution,
is the dependent variable, or label, is the intercept,is the vector of covariates, or features, on which
depend, and a vector of coefficients corresponding to . In the text complexity case, could be an indicator value indicating whether the text is easytoread or not, while is a vector of text feature values. The CDFcan be replaced by any other CDF, for example the logistic to get the Logit model, but the Probit model implies a characterisation where the underlying latent readability variable is normally distributed, which we will now explain in detail.
The Probit model can be interpreted as a latent variable model. A latent variable model assumes one or more unobserved, or latent, variables to be the drivers of the observed data. The following model is easily seen to be equivalent to the Probit model in (1)
(2) 
where is the latent variable and is independent noise.
A slight reformulation of the model that is more suitable for our purposes is to not fix the threshold at 0 and model the intercept , but rather to model the threshold as , given the equivalent model
(3) 
A latent variable formulation allows us to view the Probit model as a linear regression over an unobserved, or latent, real valued variable which underlies the assigned labels in the classification problem. The observed variable
is simply an indicator of whether is larger than the threshold, , see the left part of Figure 1 for an illustration. This is particularly useful when different classes are defined by the degree of some linear property as in the case with easytoread classification where the underlying property is text complexity, which now is being indirectly modelled on an interval scale. Note also that if the relation between the features and the latent variables is expected to be nonlinear, we can always add polynomial or spline terms in the feature set friedman2001elements.The latent variable formulation gives an elegant interpretation of the Probit model, but has also very attractive computational properties. The main goal in Bayesian inference is the posterior distribution of
(4) 
where , , is the likelihood function and is the prior distribution. This posterior distribution is mathematically intractable and the usual practice is to explore it by simulation. The latent variable formulation can here be used to obtain a very simple and effective so called Gibbs sampling algorithm.
The Gibbs sampling algorithm is a type of Markov Chain Monte Carlo (MCMC) simulation to sample from a multivariate probability distribution. The algorithm works iteratively by drawing a new value for each variable conditional on the most recent draws of all other variables. The trick here is to
augment the observed data with the unobserved latent variable , and to sample from the joint posterior distribution of both the latent variable and . Pseudo code for the algorithm is given in Algorithm 1; details are provided in the original article by Albert1993BayesianData.2.2 The Ordered Probit model
The Ordered, or Ordinal, Probit model is an extension of the Probit model from a binary response to a response on the ordinal scale. The assumption of an ordinal response variable allows us to model more than two classes using a single latent variable by estimating different thresholds, , for each class. In essence, we model the probability of belonging to class as
(5) 
where is the threshold for class and .
The latent variable model in Equation (3) can be extended to classes:
(6) 
In this case, the observed variable is an indicator for which interval the latent variable falls within or, in other words, which ordinal class belongs to. In the case with , the ordered Probit reduces to a regular binary Probit which is illustrated in Figure 1.
The joint posterior of and in the ordered Probit is given by the following equation
(7) 
where is the prior. Similar to the regular Probit, this posterior is largely intractable, though some point estimates can be approximated Cowles1996AcceleratingModels. Similar to the Gibbs sampling algorithm for the binary Probit in Algorithm 1, we can augment the data with the latent variable and explore the joint posterior of , and by a three block Gibbs sampler. However, the full conditional posterior of is intractable but can be sampled by adding a MetropolisHastings (MH) step to the Gibbs sampler. This is first presented in Albert1993BayesianData and later improved in Cowles1996AcceleratingModels by adding blocking and a MetropolisHastingswithinGibbs step to handle problems with slow mixing of the original approach. A similar approach, again by Albert2001SequentialData, sidestepped the MetropolisHastingswithinGibbs step of Cowles1996AcceleratingModels while retaining the same form of blocking. Here, we will present an approach using MetropolisHastingswithinGibbs based on the Cowles1996AcceleratingModels sampler. The Gibbs sampler is presented in Algorithm 2, with the necessary conditional posterior distributions obtained as follows.
For the full conditional posterior is proportional to
(8) 
which is intractable due to the unknown proportionality constant. Cowles1996AcceleratingModels proposed a MetropolisHastings step using a truncated Normal (TN) proposal distribution with the values from the previous state in the chain as upper cutoff points.
For the latent response variable , the conditional posterior is
(9) 
(10) 
That is, for each step in the chain, we draw from the posterior of a linear regression model with the current values of the latent variable as response.
2.3 The MultiScale Probit
Our methodological contribution in this article stems from the observation that the latent variable formulation of the binary and ordered Probit models opens up the possibility to learn about readability from multiple corpora that each use a different ordinal scale. The same underlying latent text complexity variable is assumed to drive all of the observed readability scores in the different corpora. To fix ideas, we can imagine a data set where 20 of the examples come from the binary easy/hard labelling in the left hand side of Figure 1 and the remaining examples comes from the scale with three redability classes, easy/medium/hard, in the right part of Figure 1. Note that, for example, ’easy’ may have a different meaning in the two scales, and is something that we will learn from the data.
We propose an extension of the existing Probit framework, here referred to as MultiScale Probit. Assume that a total of examples are labelled on different scales. Define a variable , for , such that means that response label, is measured on scale . Also, let denote the number of classes for scale . Finally, define as the collection of thresholds for scale . The MultiScale Probit is then defined as
(11) 
for .
The above formulation gives us the ability to fit a single latent variable with coefficients to the observed data. It should now be clear why no intercept is included in . As is shared among all response variables regardless of scale, an intercept coefficient in would mean that some would have to be locked down to 0. This would then shift the vectors for all other response variables with regards to that intercept which seems counterintuitive, it also means one response variable is treated differently from the others making the model more complex.
The joint posterior for this Probit is very similar to the posterior for the Ordered Probit. As the different are independent except through , the only difference is that we need to add a mapping from each to the set of thresholds corresponding to its scale, which we, as mentioned above, denote .
(12) 
where is the total number of scales. This posterior is as intractable as the posterior of the regular Ordered Probit and again we apply the Gibbs sampling approach to simulate from the joint posterior. The full conditional posteriors for the three blocks are given below.
The full conditional posterior for
(13) 
where contains all thresholds except for scale , and the product runs over all observations from scale . This distribution is not of known form, and we sample it with a MetropolisHastingswithinGibbs step.
The conditional posterior for
only depends on and thus is the same as (10).
The full conditional for the latent variable,
(14) 
Turning these conditionals into a Gibbs sampler is also very similar to Ordered Probit and is illustrated in Algorithm 3.
Implementation
Our implementation of this Gibbs sampler was built using Armadillo Sanderson2016Armadillo:Algebra and wrapped in R RCoreTeam2018R:Computing using the Rcpp Eddelbuettel2011Rcpp:Integration and RcppArmadillo Eddelbuettel2014RcppArmadillo:Algebra libraries.
3 Evaluation procedure
3.1 Simulation setup
To evaluate the performance of the MultiScale Probit model in comparison to the Binary and Ordered Probit model we first inspect the model fit and prediction performance using simulated data. For this purpose we implemented Algorithm 4 which randomly generates data using the assumptions of the Probit model.
Using such simulated data we can examine how well the different versions of the model estimate the known and for each simulated data set. By computing the rootmeansquare error (RMSE) for each draw of and we can inspect the posterior distribution of these RMSEs and plot them to compare the performance of the established Probit models and our proposed MultiScale Probit. On real world text complexity data we evaluate the fit of a model by testing its predictive performance on a validation set.
One problem with the text complexity data we have available is that with regards to the established models the data suffers from the problem, i.e. that the number of covariates is larger than the number of observations. The linear regression for the latent variable requires the solution of a system of equations which in the case is underdetermined and thus have infinitely many solutions. The MultiScale Probit, being able to use all three corpora in a single model, does not suffer from this problem, which is one reason for why we propose it.
We confront the problem by using a regularising prior on . This so called regularisation makes the regression problem tractable even in an underdetermined situation friedman2001elements. The prior variance, i.e. the degree of regularisation, can also be estimated in a hierarchical Bayesian approach, see Section 6.1.
Since the models are fit by simulating from the posterior distribution using MCMC, we can obtain the distribution of the evaluation metrics by computing them for each draw. This allows us to plot kernel density estimates of the posterior predictive distributions of some common summary statistics for classification and ranking. The kernel smoothing is only used to make the plots easier to interpret and does not impact the performance or the conclusions drawn.
We will show a few examples of insample performance but as there is little difference between different models on insample performance we will focus mainly on outof sample performance evaluated using a validation set.
Below, we present the summary statistics for which we compute the posterior predictive distributions.
3.2 Classification, the Fmeasure
Precision for a class is the ratio of instances correctly classified as to all instances classified as :
(15) 
Recall for the class is the ratio of instances correctly classified as to all instances of in the data:
(16) 
The Fmeasure VanRijsbergen1974FoundationEvaluation is a well established evaluation metric for classification algorithms consisting of the harmonic mean of
precision and recall. In our multiclass context scores are computed for each class.(17) 
3.3 Ranking, the Kendall Rank Correlation
Since the latent variable gives a near total ordering of all data points, the MultiScale Probit can be viewed as not only a model for classification but for ranking. The quality of this ranking can be assessed using the Kendall Rank Correlation Coefficient, or Kendall’s Kendall1955RankEd.. As the reference will not be a total order, we use a version called which makes adjustments for ties. Kendall’s takes values on the interval where indicates perfect correlation, indicates no correlation and indicates an inverse correlation.
3.4 A combined evaluation metric
We also compute the harmonic mean of the and
. The harmonic mean tends to put more weight on small outliers and less weight on large. The purpose of this measure is to mitigate any impact from tradeoffs between single performance metrics.
4 Simulation experiments
The MultiClass Probit model is explored on real data in the next section in an application to text complexity analysis. In this section, we investigate the performance of the model and its associated Bayesian inference machinery on simulated data. The first experiment simulates data sets from the same underlying latent variable distribution but uses three different sets of thresholds , hence producing three data sets on different scales. We compare the performance of traditional Probit and Ordered Probit on each data set with the performance of the MultiScale Probit applied on all data simultaneously. The second experiment repeats Experiment 1, but for the situation. In the following, we will refer to the Binary and Ordered Probit models as singlescale Probits to differentiate them from our new MultiScale Probit.
4.1 Experiment 1: Simulated data,
A data set consisting of three different subsets are randomly generated using the definitions of the Probit and Ordered Probit models with the same vector for each data set but using three different using Algorithm 4. The parameters of these data are displayed in Table 1. Note that the parameters for simulating Set 2 and Set 3 are exactly the same and we thus expect similar results for singlescale models applied to these.
Parameter  Value(s) 

No. of repetitions  500 
Total number of draws  250 000 (product of and ) 
Data  
No. of covariates ()  48 
No. of data points  400 per vector, with at least 1 instance per class label. 
Number of class labels per scale  
MCMC hyperparameters  
Burn in phase  50000 steps 
Thinning  1 step in 100 is stored 
No. of stored draws  500 
1.0,  
0.3  
0.3  
All empirically chosen to get a mean acceptance rate close to 0.234 as suggested by Roberts1997WeakAlgorithms  
(0, …, 0)  
These data are then fed into our Gibbs sampler, using the parameters shown in Table 1. Four different simulations are performed, one for each data set, i.e. one Binary Probit simulation for set 1, two Ordered Probit simulations for set 2 and 3, and one MultiScale Probit simulation using all three data sets at once. This procedure is repeated 500 times using different values for and .
After running the Gibbs sampler on our 500 different data sets, we start by inspecting the distribution of posterior rootmeansquare error for () for all 250 000 draws made during the 500 repetitions. Plots for each data set is provided in Figure 2. In each subfigure the posterior distribution of given a singlescale model, Probit or Ordinal Probit, estimated using a single data set is compared to the posterior distribution of given all three data sets. The plots indicate that the error is smaller using the MultiScale Probit model. This is not surprising as the value of is estimated using three times as much data in the MultiScale compared to the singlescale models. Note that MultiScale model distribution is exactly the same in all three subplots, but the scales of the graphs differ.
With regards to rootmeansquare error of () we expect a much more modest difference as each is estimated with the same amount of data in both the singlescale and the MultiScale models. This prediction bears in Figure 3 where the very slight difference between the distributions can probably be explained as an effect of the slightly better estimate of in the MultiScale model.
4.2 Experiment 2: Simulated data,
The same experimental setup is used as in Experiment 1 but with parameters adapted to simulate a situation. The parameters are chosen to resemble the conditions in the text complexity data. The full set of experimental parameters is listed in Table 2. In this setup and .
Parameter  Value(s) 

No. of repetitions  500 
Total number of draws  250 000 (product of and ) 
Data  
No. of covariates ()  48 
No. of data points  40 per vector, with at least 1 instance per class label. 
Number of class labels per scale  
MCMC hyperparameters  
Burn in phase  50000 steps 
Thinning  1 step in 100 is stored 
No. of stored draws  500 
5.0  
1.9  
1.9  
All empirically chosen to get a mean acceptance rate close to 0.234 as suggested by Roberts1997WeakAlgorithms  
(0, …, 0)  
We then start by inspecting the for all 250 000 draws and plot the distributions in Figure 4. The is much larger than for the case, which is expected with only 1/10 of the amount of training data, but the MultiScale model still outperforms the singlescale models.
As the overlap is quite large we would like to see to whether the MultiScale model consistently outperforms the singlescale models on a majority of the simulated data sets. We can do this by computing the posterior mean for each model and each of the 500 simulated data sets. We then compute the ratio between the posterior mean for each data set. The result is plotted in Figure 5 which indicates that the MultiScale model consistently outperforms the singlescale models.
Again, as we can see in Figure 6, there is no noticeable difference in ().
5 Application to text complexity analysis
In this section we will illustrate the workings and predictive performance of the MultiClass Probit in an application to text complexity analysis, or as it is often referred to, readability analysis. Corpora relevant to text complexity analysis are usually organised by an approximate scale used by a specific publisher, such as a publisher of children’s fiction with texts aimed at different age groups. In other cases, a corpus might consist of only easytoread (ETR) texts from a single source, such as an easytoread newspaper or news aggregator. These can be combined with a corpus containing similar texts but written for a more typical readership, such as a regular newspaper.
Data driven modelling approaches have therefore been restricted to using a single corpus, aggregated corpora by lowest common denominator (e.g. easy to read vs regular text) or a manual relabelling with existing annotations as support. Our proposed MultiScale Probit model is an attempt to allow for using all existing data on potentially different scales in a single model to learn about a single underlying latent readability factor.
It could be argued that the definition of text complexity varies somewhat between genres and domains, and for that reason we have decided to only include data from a single genre, fiction, in this experiment. However, see Section 6 for a proposed approach to integrating multiple domains in our model.
5.1 Feature set
We have used a subset of features from the set of 118 features covered in Falkenjack2013FeaturesText by discarding some features with majority zero or constant values in any data set. We also removed features to make sure the Pearson correlation between any pair of features was below . This cutoff point was selected as it provided a reasonable tradeoff between the condition number of the data matrix () and the number of included features (48). The included features and short descriptions of these are listed in Table 4.
5.2 Corpora
The text data comes from five different sources, three publishers of easytoread fiction with different text complexity labelling schemes, and two publishers of general fiction aimed at typical adult readers, Table 3.
Publisher  Number of texts 

Lättlästförlaget  14 easytoread 
Legimus  11 aimed at 39 year olds, 7 aimed at 1012 year olds, 5 aimed at 1319 year olds 
Hegas  3 very easy, 5 easy, 6 moderately easy 
Norstedts  23 aimed at typical adult readers 
Bonnier  129 aimed at typical adult readers 
The data is organised into three sets. One binary set combining the ETR texts from Lättlästförlaget with a sample from Norstedts and Bonnier, one set combining the three levels of Legimus texts with a sample from Norstedts and Bonnier as fourth most complex level, and one following the same strategy with Hegas texts. Each of these three sets represents a different scale of text complexity, two with 4 levels and one with 2 levels. We will refer to these three data sets as LL, Legimus and Hegas.
As in the case with simulated data, we want to estimate the performance of the models given different inputs. To evaluate the performance of the model and the estimation methodology we generate 500 data sets by randomly splitting the data into training sets consisting of 2/3 of the data, and test sets containing the remaining 1/3 as validation set.
5.3 Predictive performance
Each figure below contains three subfigures. Each subfigure contains either two distributions or a single distribution representing a comparison between two distributions. In the case where two distributions are plotted, one distribution, coloured blue or yellow, represents the performance of a singlescale model, Probit or Ordinal Probit, estimated using training data from a single corpus and evaluated using validation data from the same corpus. The other distribution, coloured grey, represents the performance of the MultiScale Probit estimated using all three corpora but evaluated only using validation data from a single corpus. In the case where only a single distribution is plotted, the colours represent which model performs better in that part of the distribution.
In Figure 7 we can see that as with the simulated data, insample performance does not differ noticeably between singlescale Probits and the MultiScale Probit. As discussed in Section 4, this is the expected behaviour, and we will not further plot insample performances for any metrics.
Looking at outofsample classification performance in Figure 8, we see that the MultiScale Probit outperforms the singlescale models, albeit to a smaller extent than in the simulation experiments in Section 4. There is however a large variability in scores over the 500 generated test data sets, which makes it hard to accurately compare models based only on Figure 8.
Figure 9 instead depicts densities of the posterior mean differences between models, that is, the difference between the mean scores for the models for each of the 500 training sets. This assesses whether one model consistently outperforms the other across all generated data sets. Figure 9 shows that the MultiScale model tends to outperform its singlescale counterpart on a majority of the data sets, in particular for the LL corpus where it is better on 87 % of the data sets.
Figure 10 and Figure 11 show that the rankings from the MultiScale Probit clearly improves upon the rankings from the singlescale models. In particular, Figure 11 shows that the Kendall correlations from the MultiScale Probit are closer to one than the singlescale models in a clear majority of the 500 generated test data sets.
Figures 12 and 13 display the posterior distributions of the harmonic mean between scores and Kendall correlation.
Finally, we explore how the models perform with less training data by repeating the above experiments, but this time using only 1/3 of the data for training and evaluating on the remaining 2/3. The trainingtest split is again repeated 500 times. Figure 14 displays the posterior distributions of harmonic means of score and Kendall correlation for 500 different training sets using this setup, and Figure 15 shows the differences between the models for each data set. It is clear that the advantage of the MultiScale Probit increases with smaller training data sets.
There are two opposing factors that determine the relative success of the MultiScale model: the advantage of pooling the data over multiple corpora against the restriction to a single latent variable driving all corpora. In highly informative data sets with many data points and lowdimensional feature sets the benefits from data pooling may not outweigh the disadvantage of the single latent variable restriction, assuming that the corpora do not fully satisfy the restriction.
5.4 Posterior analysis
In order to get the best possible posterior estimate, we ran 8 chains of the Gibbs sampler on all data from the three corpora and combined the resulting samples.
5.4.1 Posterior for
As with all Bayesian regression models we can inspect the posterior distribution for each coefficient in order to reason about its influence on the latent variable. In this context, this is equivalent to reasoning about the influence of a linguistic feature on text complexity. In this case there are 48 covariates and we have selected a few illustrative examples.
The three covariates with the least uncertainty in the posterior are frequency of relative/interrogative pronouns (pos_HP), for example vem (who), vad (what), and vilket (that
), the ratio of words existing in any category in the SweVoc lexicon (ratioSweVocTotal), and the ratio of grammatical dependency relations where the dependent occurs after its head word (ratioRightDeps). The marginal posteriors for each of these are plotted in Figure
16.The frequency of relative/interrogative pronouns has a rather certain positive influence on text complexity. In this context positive means that a higher frequency of relative/interrogative pronouns indicate a more complex text. The feature ratioSweVocTotal instead has a relatively strong negative influence on text complexity. That is, a larger proportion of words in the text which belong to a lexicon of common and "simple" words result in a lower text complexity value, that is, a less complex text.
These marginal posteriors can be contrasted to the three most uncertain marginal posteriors. These are the frequency of the infinitive object complement grammatical construct (dep_VO), the frequency of attitude adverbials (dep_MA), and the frequency of verbs with exactly 5 dependants (verbArity5). The marginal posteriors for each of these are plotted in Figure 17. These features are hence likely to not be informative about the complexity of the texts.
Contrasting these results quickly to previous research on the classification performance of linguistic features in the context of Support Vector Machines Falkenjack2013FeaturesText, Falkenjack2014ClassifyingParsing we can see that our results are quite different. Falkenjack2014ClassifyingParsing found that the ratio of relative/interrogative pronouns performed barely better than chance on the task of classifying mixedgenre easytoread texts. The ratio of SweVoc words and the ratio of rightward dependencies were clearly better than chance but were not among the strongest predictors. It should be noted however that our feature set is a subset of the feature set used by Falkenjack2013FeaturesText and that we are also comparing very different types of analyses using different data sets. Our results do agree with Falkenjack2013FeaturesText regarding rate of infinitive object complements and attitude adverbials not being particularly strong predictors of text complexity.
5.4.2 Posterior for
One of the strongest arguments for the MultiScale Probit model compared to singlescale Probit models is the ability to compare scales to each other. In Figure 18 we can see the marginal posteriors for for all three scales, as estimated by the MultiScale Probit, plotted together.
We can see from the figure that the posterior modes of the highest threshold are similar for all three data sets. This is to be expected as the texts constituting the most complex category for each scale all come from the corpus made up by combining the Norstedts and Bonnier corpora (see Table 3). This can be interpreted as creating a shared ceiling for the three scales. However, the thresholds are unevenly distributed on the parts of the scale estimated using different corpora. For instance, even though the Legimus and Hegas corpora each contain three categories (four when the Norstedts/Bonnier texts are added) the Legimus scale seems more fine grained on the interval while the Hegas scale seems more fine grained on the interval . This visualisation also illustrates how we can compute the probability distribution for, for example, the suitable Hegascategory of a text from the LL corpus.
We can contrast this with each posterior estimated using separatescale models, which we plot in Figure 19. The posterior modes of the highest threshold for each scale no longer line up as well, i.e. there no longer seems to be a shared ceiling. The thresholds along the lower parts of the scale seem more evenly distributed but we can no longer see that the scales are more fine grained on different intervals and that the most complex Hegas category encompasses the two most complex Legimus categories, and that the least complex Legimus category is split among the two least complex Hegas categories.
6 Conclusion and Future Work
We have shown that the MultiScale Probit can be fitted to data with a shared latent variable measured on different scales and that this new Probit outperforms the traditional binary Probit and the Ordered Probit in the majority of cases when data is sparse but multiple previously incompatible data sets are available. The model performs better than established Probit models with regards to both classification and ranking.
The multiscale assumption of a single latent variable driving all corpora imposes a restriction which will have to be weighed against the advantage of pooling data. In situations where data is less scarce and the predictive accuracy on specific scales is important the MultiScale Probit might not measure up to a singlescale model. On the other hand, in the typical situation in practical work when data are scarce and many features are used, the advantages of data pooling are obvious. In applications such as ours, where we are explicitly modeling a generalisation of nominally equivalent scales, the slight averaging effect from pooling might even be viewed as an advantage. We also note that the assumption of a single latent readability factor makes the model highly interpretable, which is in itself a strong point for the proposed model.
All in all we find these results very promising. Below are some suggestions for issues for future research.
6.1 The problem
We fixed the prior precision for ,
, for reasons of simplicity, but it is straightforward to treat the shrinkage parameter as an unknown parameter with a Gamma prior. The full conditional posterior of the shrinkage parameter then follows an inverse Gamma distribution, which is easy to sample from in a separate Gibbs update step.
Another approach to the problem would be to use Bayesian variable selection in order to lower the number of covariates. George1993VariableSampling indicate how variable selection could be integrated into a Gibbs sampler for Bayesian linear regression. Since the update step for
in the MultiScale Probit is a simple linear regression update, it is straightforward to implement Bayesian variable selection and to sample a binary variable selection indicator for each feature jointly with
in the Gibbs sampler Smith1996NonparametricSelection.6.2 The generality/specificity tradeoff
The version of the MultiScale Probit model presented here makes the assumption that the latent variable is exactly the same for each data set. However, it is not difficult to imagine ways to model scale specific deviations from a mostly shared latent variable. For instance, scale specific variable selection could be introduced into the model where each coefficient of the latent variable is split into a shared and a scale specific part. A prior would then be used to put as much of the effect as possible into the shared latent variable and only the small deviations into the scale specific parts. This can be combined with variable selection to learn if a single latent variable is needed for each corpus, see Villani2012GeneralizedMixtures for a similar approach in a different context.
6.3 Linguistic application
Our application to text complexity in Section 5 can certainly be extended by linguists in a number of interesting ways, and it will be interesting to see the model applied to other corpora or other situations with classification problems using data sets with different ordinal scales.
Appendix A Features
This appendix contains short descriptions of the features used in Section 6.3 as well as plots for the marginal posteriors for all coefficients in .
Feature descriptions
Feature  Description 

ratioSweVocTotal  Total ratio of words from the SweVoc lexicon 
ratioSweVocD 
Ratio of words from the SweVoc D category (words for everyday use) 
ratioSweVocH 
Ratio of words from the SweVoc H category (other highly frequent words) 
PartofSpeech tag frequencies 

pos_RG  Cardinal number 
pos_HP 
Interrogative/Relative Pronoun 
pos_RO 
Ordinal number 
pos_MID 

pos_HD 
Interrogative/Relative Determiner 
pos_KN 
Conjunction 
pos_HA 
Interrogative/Relative Adverb 
pos_PM 
Proper Noun 
pos_PS 
Possessive 
lexicalDensity 
Ratio of nouns, verbs, adjectives and adverbs to all words 
Dependency type tag frequencies 

dep_VS  Infinitive subject complement 
dep_VO 
Infinitive object complement 
dep_I. 
Question mark 
dep_RA 
Place adverbial 
dep_IF 
Infinitive verb phrase minus infinitive marker 
dep_MA 
Attitude adverbial 
dep_.F 
Coordination at main clause level 
dep_XX 
Unclassifiable grammatical function 
dep_IO 
Indirect object 
dep_IQ 
Colon 
dep_.A 
Conjunctional adverbial 
dep_IU 
Exclamation mark 
dep_AA 
Other adverbial 
dep_AG 
Agent 
dep_.. 
Coordinating conjunction 
dep_CA 
Contrastive adverbial 
dep_FS 
Dummy subject 
dep_KA 
Comparative adverbial 
dep_XF 
Fundament phrase 
dep_FP 
Free subjective predicative complement 
dep_OA 
Object adverbial 
dep_TA 
Time adverbial 
dep_HD 
Head 
dep_DB 
Doubled function 
dep_SP 
Subjective predicative complement 
dep_OP 
Object predicative 
dep_OO 
Direct object 
dep_PL 
Verb particle 
Dependency structure features 

ratioRightDeps  The ratio of dependency relations where the head word occurs after the dependent 
verbArity0 
The frequency of verbs with no dependents 
verbArity1 
The frequency of verbs with 1 dependent 
verbArity2 
" 2 dependents 
verbArity3 
" 3 dependents 
verbArity5 
" 5 dependents 
verbArity6 
" 6 dependents 
Concluded 
Comments
There are no comments yet.