Modeling Text Complexity using a Multi-Scale Probit

11/12/2018 ∙ by Johan Falkenjack, et al. ∙ Linköping University 0

We present a novel model for text complexity analysis which can be fitted to ordered categorical data measured on multiple scales, e.g. a corpus with binary responses mixed with a corpus with more than two ordered outcomes. The multiple scales are assumed to be driven by the same underlying latent variable describing the complexity of the text. We propose an easily implemented Gibbs sampler to sample from the posterior distribution by a direct extension of established data augmentation schemes. By being able to combine multiple corpora with different annotation schemes we can get around the common problem of having more text features than annotated documents, i.e. an example of the p>n problem. The predictive performance of the model is evaluated using both simulated and real world readability data with very promising results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 12

page 13

page 15

page 16

page 19

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text complexity is a concept inherently tied to the concept of readability. Readability meanwhile is commonly defined as "the sum total (including all the interactions) of all those elements within a given piece of printed material that affect the success a group of readers have with it" Dale1949TheReadability. That is, while readability is a function of both the text and a specific group of readers, text complexity is a function only of text, or a function of text and a generalised group of readers. There are certainly differences in what makes text difficult to read for different readers, for instance the difficulties a second language learner has might be very different from the difficulties of a reader with dyslexia or aphasia. Through focusing text complexity, we attempt to create a baseline model of complexity which can later on be adapted to the needs of different reader groups.

Modern models of readability analysis for classification often use classification algorithms such as SVM Petersen2007NaturalEducation,Feng2010AAssessment,Falkenjack2013FeaturesText which give us an assessment whether a text is easy-to-read or not. Such models have a very high accuracy, for instance, a model using 117 parameters from shallow, lexical, morphological and syntactic analyses achieves 98,9 % accuracy Falkenjack2013FeaturesText. However, these models do not tell us much about whether a given text is easier to read than any other text, other than the binary classification. In order to perform a more fine grained prediction we normally need to train the models using a corpus of graded texts, for an overview of such methods see Collins-Thompson2014ComputationalResearch.

There are also attempts to grade texts without an extensive corpus of graded texts Pitler2008RevisitingQuality,Tanaka-Ishii2010SortingReadability. Tanaka-Ishii2010SortingReadability present an approach which predicts relative difficulty by modelling pair-wise comparisons between texts. Another strength of this approach is that multiple corpora with text complexity annotated using different scales can be included in the same model. However, a downside of this approach is the necessity of squaring the number of training examples to model the full data set.

Texts released by different publishers of materials for readers with varying reading proficiency are often measured on different scales. Some publishers simply use an "easy-to-read" label, other label material with an intended age group, and others use a qualitative scale of difficulties where materials are placed in ordered categories. All these scales, though containing varying number of categories, in some sense measure the same thing. Text complexity can thus be viewed as a shared latent variable underlying different measures of readability.

In this paper we propose a new statistical model we refer to as Multi-Scale Probit, based on the well established Ordered Probit model. The Ordered Probit, as well as the traditional Binary assume that the response variable is measured on a single ordinal scale, or is classified with a binary label. The Multi-Scale Probit is able to take sets of data where the response variable is measured on different scales and find the shared latent feature, even without a vignette or "Rosetta stone" translating between the scales.

This allows us to use multiple non-overlapping corpora with text complexity annotated on different scales and find the shared phenomenon of text complexity each annotation scale is based on. In other words, we can take a corpus with texts organised by target reader groups of different ages, a corpus with texts organised by degree of readability, and a corpus of easy-to-read texts, and put them all in the same model and estimate a shared model of text complexity.

In this study we apply this new model to both simulated and real data with promising results.

2 The Multi-Scale Probit model

In this section we will present our proposed model. For background we will start by introducing the original dichotomous Probit and the Ordered Probit before showing how this later model can be generalised to multiple scales.

2.1 The Probit Model

The Probit model is a well established statistical model for supervised statistical classification with some properties which makes it especially suitable to Bayesian modelling McCulloch:00. The Probit model takes the following form for the

th observation in the sample

(1)

where

is the Cumulative Distribution Function (CDF) for the Standard Normal distribution,

is the dependent variable, or label, is the intercept,

is the vector of covariates, or features, on which

depend, and a vector of coefficients corresponding to . In the text complexity case, could be an indicator value indicating whether the text is easy-to-read or not, while is a vector of text feature values. The CDF

can be replaced by any other CDF, for example the logistic to get the Logit model, but the Probit model implies a characterisation where the underlying latent readability variable is normally distributed, which we will now explain in detail.

The Probit model can be interpreted as a latent variable model. A latent variable model assumes one or more unobserved, or latent, variables to be the drivers of the observed data. The following model is easily seen to be equivalent to the Probit model in (1)

(2)

where is the latent variable and is independent noise.

A slight reformulation of the model that is more suitable for our purposes is to not fix the threshold at 0 and model the intercept , but rather to model the threshold as , given the equivalent model

(3)

A latent variable formulation allows us to view the Probit model as a linear regression over an unobserved, or latent, real valued variable which underlies the assigned labels in the classification problem. The observed variable

is simply an indicator of whether is larger than the threshold, , see the left part of Figure 1 for an illustration. This is particularly useful when different classes are defined by the degree of some linear property as in the case with easy-to-read classification where the underlying property is text complexity, which now is being indirectly modelled on an interval scale. Note also that if the relation between the features and the latent variables is expected to be nonlinear, we can always add polynomial or spline terms in the feature set friedman2001elements.

The latent variable formulation gives an elegant interpretation of the Probit model, but has also very attractive computational properties. The main goal in Bayesian inference is the posterior distribution of

(4)

where , , is the likelihood function and is the prior distribution. This posterior distribution is mathematically intractable and the usual practice is to explore it by simulation. The latent variable formulation can here be used to obtain a very simple and effective so called Gibbs sampling algorithm.

The Gibbs sampling algorithm is a type of Markov Chain Monte Carlo (MCMC) simulation to sample from a multivariate probability distribution. The algorithm works iteratively by drawing a new value for each variable conditional on the most recent draws of all other variables. The trick here is to

augment the observed data with the unobserved latent variable , and to sample from the joint posterior distribution of both the latent variable and . Pseudo code for the algorithm is given in Algorithm 1; details are provided in the original article by Albert1993BayesianData.

Input: response labels , feature data , initial value , initial value , number of Gibbs iterations .
for  to  do
       Draw for each observation from truncated normal (TN) distributions.
       Draw using standard formulas for Bayesian Gaussian linear regression.
end for
Output: autocorrelated posterior draws and .
Algorithm 1 The data augmented Gibbs sampler for the Probit model in (2) with an intercept.

2.2 The Ordered Probit model

The Ordered, or Ordinal, Probit model is an extension of the Probit model from a binary response to a response on the ordinal scale. The assumption of an ordinal response variable allows us to model more than two classes using a single latent variable by estimating different thresholds, , for each class. In essence, we model the probability of belonging to class as

(5)

where is the threshold for class and .

Probit
Ordered Probit
Figure 1: Latent variable representation of the Probit (left) and the Ordered Probit (right).

The latent variable model in Equation (3) can be extended to classes:

(6)

In this case, the observed variable is an indicator for which interval the latent variable falls within or, in other words, which ordinal class belongs to. In the case with , the ordered Probit reduces to a regular binary Probit which is illustrated in Figure 1.

The joint posterior of and in the ordered Probit is given by the following equation

(7)

where is the prior. Similar to the regular Probit, this posterior is largely intractable, though some point estimates can be approximated Cowles1996AcceleratingModels. Similar to the Gibbs sampling algorithm for the binary Probit in Algorithm 1, we can augment the data with the latent variable and explore the joint posterior of , and by a three block Gibbs sampler. However, the full conditional posterior of is intractable but can be sampled by adding a Metropolis-Hastings (MH) step to the Gibbs sampler. This is first presented in Albert1993BayesianData and later improved in Cowles1996AcceleratingModels by adding blocking and a Metropolis-Hastings-within-Gibbs step to handle problems with slow mixing of the original approach. A similar approach, again by Albert2001SequentialData, sidestepped the Metropolis-Hastings-within-Gibbs step of Cowles1996AcceleratingModels while retaining the same form of blocking. Here, we will present an approach using Metropolis-Hastings-within-Gibbs based on the Cowles1996AcceleratingModels sampler. The Gibbs sampler is presented in Algorithm 2, with the necessary conditional posterior distributions obtained as follows.

For the full conditional posterior is proportional to

(8)

which is intractable due to the unknown proportionality constant. Cowles1996AcceleratingModels proposed a Metropolis-Hastings step using a truncated Normal (TN) proposal distribution with the values from the previous state in the chain as upper cut-off points.

For the latent response variable , the conditional posterior is

(9)

where refers to the Normal distribution with mean

and variance 1, truncated to the interval

.

Lastly, given the proper conjugate prior

, the conditional posterior of is

(10)

That is, for each step in the chain, we draw from the posterior of a linear regression model with the current values of the latent variable as response.

Input: response labels , feature data , initial values for and , the prior mean and precision of , and , a tuning parameter for the MH proposal distribution , number of Gibbs iterations .
for  to  do
       for  to  do
             Simulate from the truncated normal distribution on the interval .
       end for
      Perform a Metropolis-Hastings accept/reject for . for  to  do
             Simulate from the truncated normal distribution on the interval .
       end for
      Simulate from the multivariate normal distribution in (10).
end for
Output: autocorrelated posterior draws and .
Algorithm 2 The Gibbs sampler in Cowles1996AcceleratingModels, adapted to the Ordered Probit model without intercept in (6).

2.3 The Multi-Scale Probit

Our methodological contribution in this article stems from the observation that the latent variable formulation of the binary and ordered Probit models opens up the possibility to learn about readability from multiple corpora that each use a different ordinal scale. The same underlying latent text complexity variable is assumed to drive all of the observed readability scores in the different corpora. To fix ideas, we can imagine a data set where 20 of the examples come from the binary easy/hard labelling in the left hand side of Figure 1 and the remaining examples comes from the scale with three redability classes, easy/medium/hard, in the right part of Figure 1. Note that, for example, ’easy’ may have a different meaning in the two scales, and is something that we will learn from the data.

We propose an extension of the existing Probit framework, here referred to as Multi-Scale Probit. Assume that a total of examples are labelled on different scales. Define a variable , for , such that means that response label, is measured on scale . Also, let denote the number of classes for scale . Finally, define as the collection of thresholds for scale . The Multi-Scale Probit is then defined as

(11)

for .

The above formulation gives us the ability to fit a single latent variable with coefficients to the observed data. It should now be clear why no intercept is included in . As is shared among all response variables regardless of scale, an intercept coefficient in would mean that some would have to be locked down to 0. This would then shift the vectors for all other response variables with regards to that intercept which seems counter-intuitive, it also means one response variable is treated differently from the others making the model more complex.

The joint posterior for this Probit is very similar to the posterior for the Ordered Probit. As the different are independent except through , the only difference is that we need to add a mapping from each to the set of thresholds corresponding to its scale, which we, as mentioned above, denote .

(12)

where is the total number of scales. This posterior is as intractable as the posterior of the regular Ordered Probit and again we apply the Gibbs sampling approach to simulate from the joint posterior. The full conditional posteriors for the three blocks are given below.

The full conditional posterior for

(13)

where contains all thresholds except for scale , and the product runs over all observations from scale . This distribution is not of known form, and we sample it with a Metropolis-Hastings-within-Gibbs step.

The conditional posterior for

only depends on and thus is the same as (10).

The full conditional for the latent variable,

(14)

Turning these conditionals into a Gibbs sampler is also very similar to Ordered Probit and is illustrated in Algorithm 3.

Input: response labels , feature data , initial values for and , the prior mean and precision of , and , a vector of tuning parameters for the MH proposal distribution , number of Gibbs iterations .
for  to  do
       for  to  do
             for  to  do
                   Simulate from .
             end for
            Perform a Metropolis-Hastings accept/reject for .
       end for
      for  to  do
             Simulate from the .
       end for
      Simulate from the multivariate normal distribution in (10).
end for
Output: autocorrelated posterior draws .
Algorithm 3 The Gibbs sampler for the Multi-Scale Probit model in (11).

Implementation

Our implementation of this Gibbs sampler was built using Armadillo Sanderson2016Armadillo:Algebra and wrapped in R RCoreTeam2018R:Computing using the Rcpp Eddelbuettel2011Rcpp:Integration and RcppArmadillo Eddelbuettel2014RcppArmadillo:Algebra libraries.

3 Evaluation procedure

3.1 Simulation setup

To evaluate the performance of the Multi-Scale Probit model in comparison to the Binary and Ordered Probit model we first inspect the model fit and prediction performance using simulated data. For this purpose we implemented Algorithm 4 which randomly generates data using the assumptions of the Probit model.

Input: the number of different scales , number of observations per scale , number of covariates , a vector of class labels for each scale , smallest acceptable number of observations per class .
Draw .
for  to  do
       Draw a training matrix by drawing times.
       repeat
             Draw
             for  to  do
                   Draw
                   Compute by finding the interval corresponding to in
             end for
            
      until  has at least instances of each class in ;
end for
Output: covariate matrices , corresponding response vectors , latent variable vectors , , .
Algorithm 4 Algorithm to simulate data to test the models.

Using such simulated data we can examine how well the different versions of the model estimate the known and for each simulated data set. By computing the root-mean-square error (RMSE) for each draw of and we can inspect the posterior distribution of these RMSEs and plot them to compare the performance of the established Probit models and our proposed Multi-Scale Probit. On real world text complexity data we evaluate the fit of a model by testing its predictive performance on a validation set.

One problem with the text complexity data we have available is that with regards to the established models the data suffers from the problem, i.e. that the number of covariates is larger than the number of observations. The linear regression for the latent variable requires the solution of a system of equations which in the case is under-determined and thus have infinitely many solutions. The Multi-Scale Probit, being able to use all three corpora in a single model, does not suffer from this problem, which is one reason for why we propose it.

We confront the problem by using a regularising prior on . This so called regularisation makes the regression problem tractable even in an under-determined situation friedman2001elements. The prior variance, i.e. the degree of regularisation, can also be estimated in a hierarchical Bayesian approach, see Section 6.1.

Since the models are fit by simulating from the posterior distribution using MCMC, we can obtain the distribution of the evaluation metrics by computing them for each draw. This allows us to plot kernel density estimates of the posterior predictive distributions of some common summary statistics for classification and ranking. The kernel smoothing is only used to make the plots easier to interpret and does not impact the performance or the conclusions drawn.

We will show a few examples of in-sample performance but as there is little difference between different models on in-sample performance we will focus mainly on out-of sample performance evaluated using a validation set.

Below, we present the summary statistics for which we compute the posterior predictive distributions.

3.2 Classification, the F-measure

Precision for a class is the ratio of instances correctly classified as to all instances classified as :

(15)

Recall for the class is the ratio of instances correctly classified as to all instances of in the data:

(16)

The F-measure VanRijsbergen1974FoundationEvaluation is a well established evaluation metric for classification algorithms consisting of the harmonic mean of

precision and recall. In our multi-class context scores are computed for each class.

(17)

3.3 Ranking, the Kendall Rank Correlation

Since the latent variable gives a near total ordering of all data points, the Multi-Scale Probit can be viewed as not only a model for classification but for ranking. The quality of this ranking can be assessed using the Kendall Rank Correlation Coefficient, or Kendall’s Kendall1955RankEd.. As the reference will not be a total order, we use a version called which makes adjustments for ties. Kendall’s takes values on the interval where indicates perfect correlation, indicates no correlation and indicates an inverse correlation.

3.4 A combined evaluation metric

We also compute the harmonic mean of the and

. The harmonic mean tends to put more weight on small outliers and less weight on large. The purpose of this measure is to mitigate any impact from trade-offs between single performance metrics.

4 Simulation experiments

The Multi-Class Probit model is explored on real data in the next section in an application to text complexity analysis. In this section, we investigate the performance of the model and its associated Bayesian inference machinery on simulated data. The first experiment simulates data sets from the same underlying latent variable distribution but uses three different sets of thresholds , hence producing three data sets on different scales. We compare the performance of traditional Probit and Ordered Probit on each data set with the performance of the Multi-Scale Probit applied on all data simultaneously. The second experiment repeats Experiment 1, but for the situation. In the following, we will refer to the Binary and Ordered Probit models as single-scale Probits to differentiate them from our new Multi-Scale Probit.

4.1 Experiment 1: Simulated data,

A data set consisting of three different subsets are randomly generated using the definitions of the Probit and Ordered Probit models with the same vector for each data set but using three different using Algorithm 4. The parameters of these data are displayed in Table 1. Note that the parameters for simulating Set 2 and Set 3 are exactly the same and we thus expect similar results for single-scale models applied to these.

Parameter Value(s)
No. of repetitions 500
Total number of draws 250 000 (product of and )
Data
No. of covariates () 48
No. of data points 400 per vector, with at least 1 instance per class label.
Number of class labels per scale
MCMC hyper-parameters
Burn in phase 50000 steps
Thinning 1 step in 100 is stored
No. of stored draws 500
1.0,
0.3
0.3
All empirically chosen to get a mean acceptance rate close to 0.234 as suggested by Roberts1997WeakAlgorithms
(0, …, 0)
Table 1: Experimental parameters for testing our model under conditions.

These data are then fed into our Gibbs sampler, using the parameters shown in Table 1. Four different simulations are performed, one for each data set, i.e. one Binary Probit simulation for set 1, two Ordered Probit simulations for set 2 and 3, and one Multi-Scale Probit simulation using all three data sets at once. This procedure is repeated 500 times using different values for and .

After running the Gibbs sampler on our 500 different data sets, we start by inspecting the distribution of posterior root-mean-square error for () for all 250 000 draws made during the 500 repetitions. Plots for each data set is provided in Figure 2. In each sub-figure the posterior distribution of given a single-scale model, Probit or Ordinal Probit, estimated using a single data set is compared to the posterior distribution of given all three data sets. The plots indicate that the error is smaller using the Multi-Scale Probit model. This is not surprising as the value of is estimated using three times as much data in the Multi-Scale compared to the single-scale models. Note that Multi-Scale model distribution is exactly the same in all three sub-plots, but the scales of the graphs differ.

1)
2)
3)
Figure 2: The posterior of on the three scales for all 500 simulated data sets.

With regards to root-mean-square error of () we expect a much more modest difference as each is estimated with the same amount of data in both the single-scale and the Multi-Scale models. This prediction bears in Figure 3 where the very slight difference between the distributions can probably be explained as an effect of the slightly better estimate of in the Multi-Scale model.

Figure 3: The posterior RMSE of on the three scales for all 500 simulated data sets.

4.2 Experiment 2: Simulated data,

The same experimental set-up is used as in Experiment 1 but with parameters adapted to simulate a situation. The parameters are chosen to resemble the conditions in the text complexity data. The full set of experimental parameters is listed in Table 2. In this set-up and .

Parameter Value(s)
No. of repetitions 500
Total number of draws 250 000 (product of and )
Data
No. of covariates () 48
No. of data points 40 per vector, with at least 1 instance per class label.
Number of class labels per scale
MCMC hyper-parameters
Burn in phase 50000 steps
Thinning 1 step in 100 is stored
No. of stored draws 500
5.0
1.9
1.9
All empirically chosen to get a mean acceptance rate close to 0.234 as suggested by Roberts1997WeakAlgorithms
(0, …, 0)
Table 2: Experimental parameters for testing our model under conditions.

We then start by inspecting the for all 250 000 draws and plot the distributions in Figure 4. The is much larger than for the case, which is expected with only 1/10 of the amount of training data, but the Multi-Scale model still outperforms the single-scale models.

1)
2)
3)
Figure 4: The posterior distributions of on the three scales for all 500 simulated data sets.

As the overlap is quite large we would like to see to whether the Multi-Scale model consistently outperforms the single-scale models on a majority of the simulated data sets. We can do this by computing the posterior mean for each model and each of the 500 simulated data sets. We then compute the ratio between the posterior mean for each data set. The result is plotted in Figure 5 which indicates that the Multi-Scale model consistently outperforms the single-scale models.

1)
2)
3)
Figure 5: The posterior distributions of mean ratio between Multi-Scale and single-scale models on each of the three scales for all 500 simulated data sets.

Again, as we can see in Figure 6, there is no noticeable difference in ().

Figure 6: The posterior distributions RMSE of on the 3 scales for all 500 simulated data sets.

5 Application to text complexity analysis

In this section we will illustrate the workings and predictive performance of the Multi-Class Probit in an application to text complexity analysis, or as it is often referred to, readability analysis. Corpora relevant to text complexity analysis are usually organised by an approximate scale used by a specific publisher, such as a publisher of children’s fiction with texts aimed at different age groups. In other cases, a corpus might consist of only easy-to-read (ETR) texts from a single source, such as an easy-to-read newspaper or news aggregator. These can be combined with a corpus containing similar texts but written for a more typical readership, such as a regular newspaper.

Data driven modelling approaches have therefore been restricted to using a single corpus, aggregated corpora by lowest common denominator (e.g. easy to read vs regular text) or a manual re-labelling with existing annotations as support. Our proposed Multi-Scale Probit model is an attempt to allow for using all existing data on potentially different scales in a single model to learn about a single underlying latent readability factor.

It could be argued that the definition of text complexity varies somewhat between genres and domains, and for that reason we have decided to only include data from a single genre, fiction, in this experiment. However, see Section 6 for a proposed approach to integrating multiple domains in our model.

5.1 Feature set

We have used a subset of features from the set of 118 features covered in Falkenjack2013FeaturesText by discarding some features with majority zero or constant values in any data set. We also removed features to make sure the Pearson correlation between any pair of features was below . This cut-off point was selected as it provided a reasonable trade-off between the condition number of the data matrix () and the number of included features (48). The included features and short descriptions of these are listed in Table 4.

5.2 Corpora

The text data comes from five different sources, three publishers of easy-to-read fiction with different text complexity labelling schemes, and two publishers of general fiction aimed at typical adult readers, Table 3.

Publisher Number of texts
Lättlästförlaget 14 easy-to-read
Legimus 11 aimed at 3-9 year olds, 7 aimed at 10-12 year olds, 5 aimed at 13-19 year olds
Hegas 3 very easy, 5 easy, 6 moderately easy
Norstedts 23 aimed at typical adult readers
Bonnier 129 aimed at typical adult readers
Table 3: The corpora used to evaluate the model with regards to text complexity.

The data is organised into three sets. One binary set combining the ETR texts from Lättlästförlaget with a sample from Norstedts and Bonnier, one set combining the three levels of Legimus texts with a sample from Norstedts and Bonnier as fourth most complex level, and one following the same strategy with Hegas texts. Each of these three sets represents a different scale of text complexity, two with 4 levels and one with 2 levels. We will refer to these three data sets as LL, Legimus and Hegas.

As in the case with simulated data, we want to estimate the performance of the models given different inputs. To evaluate the performance of the model and the estimation methodology we generate 500 data sets by randomly splitting the data into training sets consisting of 2/3 of the data, and test sets containing the remaining 1/3 as validation set.

5.3 Predictive performance

Each figure below contains three sub-figures. Each sub-figure contains either two distributions or a single distribution representing a comparison between two distributions. In the case where two distributions are plotted, one distribution, coloured blue or yellow, represents the performance of a single-scale model, Probit or Ordinal Probit, estimated using training data from a single corpus and evaluated using validation data from the same corpus. The other distribution, coloured grey, represents the performance of the Multi-Scale Probit estimated using all three corpora but evaluated only using validation data from a single corpus. In the case where only a single distribution is plotted, the colours represent which model performs better in that part of the distribution.

LL
Legimus
Hegas
Figure 7: The posterior distributions for in-sample scores of the text data, plotted per measurement scale.

In Figure 7 we can see that as with the simulated data, in-sample performance does not differ noticeably between single-scale Probits and the Multi-Scale Probit. As discussed in Section 4, this is the expected behaviour, and we will not further plot in-sample performances for any metrics.

LL
Legimus
Hegas
Figure 8: The posterior distributions for out-of-sample scores of the text data, plotted per measurement scale.

Looking at out-of-sample classification performance in Figure 8, we see that the Multi-Scale Probit outperforms the single-scale models, albeit to a smaller extent than in the simulation experiments in Section 4. There is however a large variability in scores over the 500 generated test data sets, which makes it hard to accurately compare models based only on Figure 8.

Figure 9 instead depicts densities of the posterior mean differences between models, that is, the difference between the mean scores for the models for each of the 500 training sets. This assesses whether one model consistently out-performs the other across all generated data sets. Figure 9 shows that the Multi-Scale model tends to outperform its single-scale counterpart on a majority of the data sets, in particular for the LL corpus where it is better on 87 % of the data sets.

LL
Legimus
Hegas
Figure 9: The posterior distributions for the difference in out-of-sample scores between single-scale and Multi-Scale models on the text data for the 500 different training sets, plotted per measurement scale.

Figure 10 and Figure 11 show that the rankings from the Multi-Scale Probit clearly improves upon the rankings from the single-scale models. In particular, Figure 11 shows that the Kendall correlations from the Multi-Scale Probit are closer to one than the single-scale models in a clear majority of the 500 generated test data sets.

LL
Legimus
Hegas
Figure 10: The posterior distributions for out-of-sample Kendall correlations of the text data, plotted per measurement scale.
LL
Legimus
Hegas
Figure 11: The posterior distributions for the difference in out-of-sample Kendall correlation between single-scale and Multi-Scale models on the text data for the 500 different training sets, plotted per measurement scale.

Figures 12 and 13 display the posterior distributions of the harmonic mean between scores and Kendall correlation.

LL
Legimus
Hegas
Figure 12: The posterior distributions for out-of-sample harmonic mean of and Kendall correlation on the text data, plotted per measurement scale.
LL
Legimus
Hegas
Figure 13: The posterior distributions for the difference in out-of-sample harmonic mean of and Kendall correlation between single-scale and Multi-Scale models on the text data for the 500 different training sets, plotted per measurement scale.

Finally, we explore how the models perform with less training data by repeating the above experiments, but this time using only 1/3 of the data for training and evaluating on the remaining 2/3. The training-test split is again repeated 500 times. Figure 14 displays the posterior distributions of harmonic means of score and Kendall correlation for 500 different training sets using this set-up, and Figure 15 shows the differences between the models for each data set. It is clear that the advantage of the Multi-Scale Probit increases with smaller training data sets.

There are two opposing factors that determine the relative success of the Multi-Scale model: the advantage of pooling the data over multiple corpora against the restriction to a single latent variable driving all corpora. In highly informative data sets with many data points and low-dimensional feature sets the benefits from data pooling may not outweigh the disadvantage of the single latent variable restriction, assuming that the corpora do not fully satisfy the restriction.

LL
Legimus
Hegas
Figure 14: The posterior distributions for out-of-sample harmonic mean of and Kendall correlation on the text data, plotted per measurement scale, using only 1/3 of the data for training.
LL
Legimus
Hegas
Figure 15: The posterior distributions for the difference in out-of-sample harmonic mean of and Kendall correlation between single-scale and Multi-Scale models on the text data for the 500 different training sets, plotted per measurement scale, using only 1/3 of the data for training.

5.4 Posterior analysis

In order to get the best possible posterior estimate, we ran 8 chains of the Gibbs sampler on all data from the three corpora and combined the resulting samples.

5.4.1 Posterior for

As with all Bayesian regression models we can inspect the posterior distribution for each coefficient in order to reason about its influence on the latent variable. In this context, this is equivalent to reasoning about the influence of a linguistic feature on text complexity. In this case there are 48 covariates and we have selected a few illustrative examples.

The three covariates with the least uncertainty in the posterior are frequency of relative/interrogative pronouns (pos_HP), for example vem (who), vad (what), and vilket (that

), the ratio of words existing in any category in the SweVoc lexicon (ratioSweVocTotal), and the ratio of grammatical dependency relations where the dependent occurs after its head word (ratioRightDeps). The marginal posteriors for each of these are plotted in Figure

16.

pos_HP
ratioSweVocTotal
ratioRightDeps
Figure 16: The three marginal posteriors for coefficients in with the least uncertainty.

The frequency of relative/interrogative pronouns has a rather certain positive influence on text complexity. In this context positive means that a higher frequency of relative/interrogative pronouns indicate a more complex text. The feature ratioSweVocTotal instead has a relatively strong negative influence on text complexity. That is, a larger proportion of words in the text which belong to a lexicon of common and "simple" words result in a lower text complexity value, that is, a less complex text.

These marginal posteriors can be contrasted to the three most uncertain marginal posteriors. These are the frequency of the infinitive object complement grammatical construct (dep_VO), the frequency of attitude adverbials (dep_MA), and the frequency of verbs with exactly 5 dependants (verbArity5). The marginal posteriors for each of these are plotted in Figure 17. These features are hence likely to not be informative about the complexity of the texts.

dep_VO
dep_MA
verbArity5
Figure 17: The three marginal posteriors for coefficients in with the most uncertainty.

Contrasting these results quickly to previous research on the classification performance of linguistic features in the context of Support Vector Machines Falkenjack2013FeaturesText, Falkenjack2014ClassifyingParsing we can see that our results are quite different. Falkenjack2014ClassifyingParsing found that the ratio of relative/interrogative pronouns performed barely better than chance on the task of classifying mixed-genre easy-to-read texts. The ratio of SweVoc words and the ratio of rightward dependencies were clearly better than chance but were not among the strongest predictors. It should be noted however that our feature set is a subset of the feature set used by Falkenjack2013FeaturesText and that we are also comparing very different types of analyses using different data sets. Our results do agree with Falkenjack2013FeaturesText regarding rate of infinitive object complements and attitude adverbials not being particularly strong predictors of text complexity.

5.4.2 Posterior for

One of the strongest arguments for the Multi-Scale Probit model compared to single-scale Probit models is the ability to compare scales to each other. In Figure 18 we can see the marginal posteriors for for all three scales, as estimated by the Multi-Scale Probit, plotted together.

Figure 18: The marginal posterior distributions for all -values from the Multi-Scale Probit model.

We can see from the figure that the posterior modes of the highest threshold are similar for all three data sets. This is to be expected as the texts constituting the most complex category for each scale all come from the corpus made up by combining the Norstedts and Bonnier corpora (see Table 3). This can be interpreted as creating a shared ceiling for the three scales. However, the thresholds are unevenly distributed on the parts of the scale estimated using different corpora. For instance, even though the Legimus and Hegas corpora each contain three categories (four when the Norstedts/Bonnier texts are added) the Legimus scale seems more fine grained on the interval while the Hegas scale seems more fine grained on the interval . This visualisation also illustrates how we can compute the probability distribution for, for example, the suitable Hegas-category of a text from the LL corpus.

We can contrast this with each posterior estimated using separate-scale models, which we plot in Figure 19. The posterior modes of the highest threshold for each scale no longer line up as well, i.e. there no longer seems to be a shared ceiling. The thresholds along the lower parts of the scale seem more evenly distributed but we can no longer see that the scales are more fine grained on different intervals and that the most complex Hegas category encompasses the two most complex Legimus categories, and that the least complex Legimus category is split among the two least complex Hegas categories.

Figure 19: The marginal posterior distributions for all -values from separate single-scale Probit and Ordered Probit models.

6 Conclusion and Future Work

We have shown that the Multi-Scale Probit can be fitted to data with a shared latent variable measured on different scales and that this new Probit outperforms the traditional binary Probit and the Ordered Probit in the majority of cases when data is sparse but multiple previously incompatible data sets are available. The model performs better than established Probit models with regards to both classification and ranking.

The multi-scale assumption of a single latent variable driving all corpora imposes a restriction which will have to be weighed against the advantage of pooling data. In situations where data is less scarce and the predictive accuracy on specific scales is important the Multi-Scale Probit might not measure up to a single-scale model. On the other hand, in the typical situation in practical work when data are scarce and many features are used, the advantages of data pooling are obvious. In applications such as ours, where we are explicitly modeling a generalisation of nominally equivalent scales, the slight averaging effect from pooling might even be viewed as an advantage. We also note that the assumption of a single latent readability factor makes the model highly interpretable, which is in itself a strong point for the proposed model.

All in all we find these results very promising. Below are some suggestions for issues for future research.

6.1 The problem

We fixed the prior precision for ,

, for reasons of simplicity, but it is straightforward to treat the shrinkage parameter as an unknown parameter with a Gamma prior. The full conditional posterior of the shrinkage parameter then follows an inverse Gamma distribution, which is easy to sample from in a separate Gibbs update step.

Another approach to the -problem would be to use Bayesian variable selection in order to lower the number of covariates. George1993VariableSampling indicate how variable selection could be integrated into a Gibbs sampler for Bayesian linear regression. Since the update step for

in the Multi-Scale Probit is a simple linear regression update, it is straightforward to implement Bayesian variable selection and to sample a binary variable selection indicator for each feature jointly with

in the Gibbs sampler Smith1996NonparametricSelection.

6.2 The generality/specificity trade-off

The version of the Multi-Scale Probit model presented here makes the assumption that the latent variable is exactly the same for each data set. However, it is not difficult to imagine ways to model scale specific deviations from a mostly shared latent variable. For instance, scale specific variable selection could be introduced into the model where each coefficient of the latent variable is split into a shared and a scale specific part. A prior would then be used to put as much of the effect as possible into the shared latent variable and only the small deviations into the scale specific parts. This can be combined with variable selection to learn if a single latent variable is needed for each corpus, see Villani2012GeneralizedMixtures for a similar approach in a different context.

6.3 Linguistic application

Our application to text complexity in Section 5 can certainly be extended by linguists in a number of interesting ways, and it will be interesting to see the model applied to other corpora or other situations with classification problems using data sets with different ordinal scales.

Appendix A Features

This appendix contains short descriptions of the features used in Section 6.3 as well as plots for the marginal posteriors for all coefficients in .

Feature descriptions

Feature Description
ratioSweVocTotal Total ratio of words from the SweVoc lexicon

ratioSweVocD
Ratio of words from the SweVoc D category (words for everyday use)

ratioSweVocH
Ratio of words from the SweVoc H category (other highly frequent words)

Part-of-Speech tag frequencies
pos_RG Cardinal number

pos_HP
Interrogative/Relative Pronoun

pos_RO
Ordinal number

pos_MID

pos_HD
Interrogative/Relative Determiner

pos_KN
Conjunction

pos_HA
Interrogative/Relative Adverb

pos_PM
Proper Noun

pos_PS
Possessive

lexicalDensity
Ratio of nouns, verbs, adjectives and adverbs to all words

Dependency type tag frequencies
dep_VS Infinitive subject complement

dep_VO
Infinitive object complement

dep_I.
Question mark

dep_RA
Place adverbial

dep_IF
Infinitive verb phrase minus infinitive marker

dep_MA
Attitude adverbial

dep_.F
Coordination at main clause level

dep_XX
Unclassifiable grammatical function

dep_IO
Indirect object

dep_IQ
Colon

dep_.A
Conjunctional adverbial

dep_IU
Exclamation mark

dep_AA
Other adverbial

dep_AG
Agent

dep_..
Coordinating conjunction

dep_CA
Contrastive adverbial

dep_FS
Dummy subject

dep_KA
Comparative adverbial

dep_XF
Fundament phrase

dep_FP
Free subjective predicative complement

dep_OA
Object adverbial

dep_TA
Time adverbial

dep_HD
Head

dep_DB
Doubled function

dep_SP
Subjective predicative complement

dep_OP
Object predicative

dep_OO
Direct object

dep_PL
Verb particle

Dependency structure features
ratioRightDeps The ratio of dependency relations where the head word occurs after the dependent

verbArity0
The frequency of verbs with no dependents

verbArity1
The frequency of verbs with 1 dependent

verbArity2
                 "                 2 dependents

verbArity3
                 "                 3 dependents

verbArity5
                 "                 5 dependents

verbArity6
                 "                 6 dependents

Concluded
Table 4: The set of text based covariates.

Marginal posteriors for

dep_VO
verbArity5
dep_MA
pos_RG
verbArity3
dep_OO
dep_KA
dep_PL
verbArity0
dep_FP
dep_DB
pos_KN
pos_HA
verbArity2
pos_MID
pos_PS
dep_FS
pos_RO
dep_AG
ratioSweVocD
pos_PM
dep_OA
pos_HD
ratioSweVocH
dep_.A
dep_..
dep_CA
dep_SP
dep_I.
dep_.F
dep_RA
dep_VS
dep_XX
dep_IO
dep_HD
verbArity1
lexicalDensity
dep_IQ
dep_XF
dep_AA
dep_TA
verbArity6
dep_IF
dep_IU
dep_OP
ratioRightDeps
ratioSweVocTotal
pos_HP
Figure 20: Marginal posteriors for all coefficients in .