Modeling Text Complexity using a Multi-Scale Probit

11/12/2018
by   Johan Falkenjack, et al.
0

We present a novel model for text complexity analysis which can be fitted to ordered categorical data measured on multiple scales, e.g. a corpus with binary responses mixed with a corpus with more than two ordered outcomes. The multiple scales are assumed to be driven by the same underlying latent variable describing the complexity of the text. We propose an easily implemented Gibbs sampler to sample from the posterior distribution by a direct extension of established data augmentation schemes. By being able to combine multiple corpora with different annotation schemes we can get around the common problem of having more text features than annotated documents, i.e. an example of the p>n problem. The predictive performance of the model is evaluated using both simulated and real world readability data with very promising results.

READ FULL TEXT

page 10

page 12

page 13

page 15

page 16

page 19

page 20

page 21

research
06/11/2015

Sparse Partially Collapsed MCMC for Parallel Inference in Topic Models

Topic models, and more specifically the class of Latent Dirichlet Alloca...
research
04/13/2019

Pólygamma Data Augmentation to address Non-conjugacy in the Bayesian Estimation of Mixed Multinomial Logit Models

The standard Gibbs sampler of Mixed Multinomial Logit (MMNL) models invo...
research
08/30/2018

Modeling Empathy and Distress in Reaction to News Stories

Computational detection and understanding of empathy is an important fac...
research
02/12/2018

Augment and Reduce: Stochastic Inference for Large Categorical Distributions

Categorical distributions are ubiquitous in machine learning, e.g., in c...
research
10/13/2022

Probabilistic Missing Value Imputation for Mixed Categorical and Ordered Data

Many real-world datasets contain missing entries and mixed data types in...
research
09/17/2023

Gibbs Sampling using Anti-correlation Gaussian Data Augmentation, with Applications to L1-ball-type Models

L1-ball-type priors are a recent generalization of the spike-and-slab pr...
research
04/12/2016

Efficient Classification of Multi-Labelled Text Streams by Clashing

We present a method for the classification of multi-labelled text docume...

Please sign up or login with your details

Forgot password? Click here to reset