A Minimum Message Length Criterion for Robust Linear Regression

02/09/2018 ∙ by Chi Kuen Wong, et al. ∙ 0

This paper applies the minimum message length principle to inference of linear regression models with Student-t errors. A new criterion for variable selection and parameter estimation in Student-t regression is proposed. By exploiting properties of the regression model, we derive a suitable non-informative proper uniform prior distribution for the regression coefficients that leads to a simple and easy-to-apply criterion. Our proposed criterion does not require specification of hyperparameters and is invariant under both full rank transformations of the design matrix and linear transformations of the outcomes. We compare the proposed criterion with several standard model selection criteria, such as the Akaike information criterion and the Bayesian information criterion, on simulations and real data with promising results.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a vector of

observations generated by the linear regression model


where is the design matrix of explanatory variables with each , is the vector of regression coefficients, is the intercept parameter, is a vector of s of length , and is a vector of random errors. In this paper, the random disturbances are assumed to be independently and identically distributed as per a Student- distribution with location zero, scale , and degrees of freedom. The Student- linear regression model finds frequent application in modelling heavy-tailed data [1, 2]. Compared with the more popular Gaussian linear regression model the Student-

linear regression model has been shown to have better performance in terms of reducing the negative effects caused by outliers 

[3, 4].

This paper investigates parameter estimation and variable selection for the Student- linear regression model using the minimum message length (MML) principle [5] and is an extension of the MML model selection criterion described in [6]. The MML criterion is known to be closely related to Rissanen’s normalized maximum likelihood (NML) denoising approach [7] and shares a number of its attractive properties (e.g., consistency) [8]. These two criteria are virtually indistinguishable when the signal-to-noise ratio is moderate to large. In contrast, when the signal-to-noise ratio is small, the NML-based approaches are known to perform poorly; due subtle differences between the complexity penalties of MML and NML, the MML approach is less prone to overfitting in these regimes.

While the MML criterion has been shown to perform well in a range of experiments using real and simulated data, it is limited to Gaussian linear regression models and is not invariant to translations of the target . In this manuscript, we generalize the MML approach to the Student- linear regression model and derive a new model selection criterion that is invariant to linear transformations of both the explanatory variables and the targets. The latter invariance property ensures that the criterion does not depend on the particular choice of the measurement units in which the data is recorded.

2 Minimum Message Length

The minimum message length (MML) principle [5, 9, 10, 11]

was introduced by C. S. Wallace and D. M. Boulton in the late 1960s. The underlying philosophy of the MML principle is that one can use data compression to learn structure from data. It is well known that if there is little structure (regularities) in the data, the probability of greatly compressing the data is vanishing small (no hypercompression inequality 

[12]). This result also implies that models which attain high levels of compression are likely to have captured real attributes of the underlying data generating mechanism.

Imagine one wishes to encode the observed data into a message string (e.g., a binary string) that is uniquely decodable. In the MML framework, this message string is called an explanation and consists of two components, namely the assertion and the detail. The assertion encodes a statement describing a statistical model (i.e., model structure and parameters) and the detail encodes the data using the model specified in the assertion. Inference within the MML framework proceeds by finding the model that results in the greater compression of the data; i.e., has the shortest two-part message. While computing the exact MML codelength for a given problem is in general NP-hard [13], there exist a range of computationally tractable approximations for computing message lengths.

The most commonly used message length formula is the MML87 approximation [11]. Suppose we are given a vector of data . Let

denote the set of probability density functions indexed by the

-dimensional parameter vector where is the parameter space. The message assertion needs to name a specific member from the uncountable set , however, from Shannon’s theory information, codelengths can only be constructed over countable sets. The core of MML is to quantize the set to a carefully chosen countable subset for which a code can be constructed. For each there is an associated Voronoi cell , called an uncertainty region, containing those distributions that are closest in some sense to . In the derivation of the MML87 approximation formula, the coding probability of a member is given by


where is the Bayesian prior density over the set and is the volume of the Voronoi cell . In the MML87 approximation, the volume is given by


where is the determinant of the Fisher information matrix and

is the normalized second moment of an optimal quantizing lattice. The uncertainty region can be interpreted as a set of distributions that are considered indistinguishable on the basis of the observed data and is related to the concept of distinguishable distributions in the MDL literature 


Recall that the MML message consists of two components, the assertion and the detail. From Shannon’s theory of information, the length of the assertion, which states the member that will be used to compress the data, is given by . The length of the detail, which encodes the data using the model named in the assertion, is given by (see [5] for more details). Using the MML87 approximation, the codelength of data and a model is given by


where is the digamma function and the constant can be approximated by ([5], pp. 257–258)


To estimate a model from data using MML, we find the estimate that minimizes the message length (4). The key idea behind the MML approach is to minimize the trade-off between the complexity of the model (measured by the assertion length) against the ability of the model to fit the observed data (measured by the detail length). Under suitable regularity conditions, the MML87 codelength is asymptotically equivalent to the well-known Bayesian information criterion [15], and the MML87 estimator is asymptotically equivalent to the maximum likelihood estimator.

From (2), it is clear that MML uses the prior distribution to construct codewords for members of and is therefore a Bayesian procedure. The MML approach can be seen as performing maximum a posteriori (MAP) over the quantized space , with the coding probability

playing the role of a discrete prior distribution, and the codelength as a counterpart to the posterior probability. Every model is therefore given a posterior probability mass (i.e., codelength) which allows for discrimination between models with different structures (e.g., non-nested models). A key difference between standard Bayesian estimators such as the MAP and posterior mean estimators, and MML is that MML estimators are parameterization invariant; i.e., they provide the same inferences under one-to-one, continuous re-parameterizations of the parameter space.

2.1 A small sample message length approximation

The MML87 approximation relies on a set of regularity conditions on the likelihood function and the prior distributions [5]. An important assumption of MML87 is that the prior distribution is approximately uniform over the entire uncertainty region . If this condition is violated, the MML87 approximation can break down and the coding probability of the assertion (2) can be greater than 1. One solution to this problem, as described in [5], is to use the small-sample message length approximation


which smoothly bounds to be less than 1. The small sample approximation retains all the attractive properties of the MML87 approximation while being robust to prior distributions that are highly peaked. The cost of introducing additional robustness to the MML87 approximation is that minimizing the small sample codelength (6) is, in general, more difficult than minimizing the original MML87 codelength (4).

3 MML Student-t linear regression

Given a vector of observations , a design matrix containing predictors indexed by the model structure where is a set of candidate models, the Student- linear regression model in (1) can be written as:


where is the -th row of the design matrix . In this paper we assume, without loss of generality, that the columns of the design matrix are standardized to have zero mean. In the remainder of this paper, the model structure index is omitted when clear from the context. Under the Student- regression model, the probability density function of is given by


where , is a scale parameter and is the degrees of freedom. The negative log-likelihood of , is


where . In order to apply the MML87 approximation (4), we require the Fisher information matrix for this model. The Fisher information matrix for the Student- regression model is [4, 16]

and the determinant of the Fisher information matrix is


Note that the determinant of the Fisher information matrix (10) for the Student- regression model is equal to the determinant of the Fisher information for the Gaussian linear regression model,


scaled by the factor which only depends on and . When the degrees of freedom , the determinant of the Fisher information matrix (10) for the Student- model reduces to that of the Gaussian model (11).

3.1 Prior distribution for the regression parameters

Recall from Section 2 that MML is a Bayesian principle and therefore requires the specification of prior distributions for all model parameters. Assuming that we do not have prior knowledge regarding the values of the coefficients, it is reasonable to give every subset of coefficients

an equal prior probability. This may be done by placing an improper uniform prior over each coefficient

, i.e.,

Using such an improper prior within the MML87 framework results in an infinite message length and bounding of the parameter space is required to ensure propriety. There exist many different approaches to bounding the parameter space; for example, one could use a uniform prior where . In this manuscript, we propose the following novel hyper-ellipsoid bounding for :


where is a hyperparameter that controls the size of the bounding region; the choice the hyperparameter is discussed in Section 3.3. The bounding hyperellipsoid uses intrinsic geometric properties of the model and has an important advantage over the conventional rectangular bounding. Our novel bounding depends on a single hyperparameter which has a natural interpretation: by noting that the quantity can be interpreted as the fitted sum-of-squares, the hyperparameter

is seen to directly control the maximum allowable signal strength (i.e., variance).

The set of coefficients that satisfies (12) is called the feasible parameter set and is aligned to the principle axes of the explanatory variables. Using a uniform prior over this feasible parameter set yields the following proper prior distribution for the regression coefficients:


For the intercept parameter , we use the uniform prior


and for the scale parameter , we use the scale-invariant prior distribution


The joint prior distribution for all parameters is


The prior distributions for the intercept (14) and scale parameters (15) are improper and also require a parameter space restriction. However, these parameters are common to all model structures and therefore the particular choice of the bounding region does not affect the relative ordering of message lengths. Therefore, the choice of the bounding region for these parameter plays no role in model selection.

3.2 Message length for Student- regression models

The choice of the prior distribution for the regression coefficients is potentially problematic when used in conjunction with the MML87 message length formula. Given a data set that is poorly explained by the predictors (i.e., a fitted signal-to-noise ratio close to zero), the MML87 estimate of the noise variance, , will be large. From (3), we have


where is the volume of the uncertainty region used to code the parameter estimates. A large value of results in a large volume of the uncertainty region. Recall from Section 2 that MML87 approximates the coding probability for the assertion (2) by a product of the prior distribution and the volume of the uncertainty region. When the volume of the uncertainty region exceeds the volume of the restricted parameter set , the coding probability of the assertion will exceed one, which is clearly nonsensical. For any given value of (i.e., the size of the restricted parameter set), there exists a such that .

One solution to this problem is to use the small-sample message length approximation formula (6) described in Section 2.1. The assertion codelength for , , is obtained from the small-sample approximation while the assertion codelength for and , , does not depend on and can be safely obtained using the standard MML87 formula. Given a value of the hyperparameter , the total message length of the data and the parameters is




and the quantization constants can be approximated by


A consequence of using the hyperellipsoidal bounding (12) is that the term that appears in both the prior distribution for (13) and the Fisher information (10) cancels in the formula for the assertion.

3.3 Estimation of the hyperparameter

The message length (18) using the joint prior distribution (16) depends crucially on the value of the hyperparameter . For the message to be uniquely decodable, the value of must be encoded into the assertion component. The complete message length is


where is the codelength for stating the hyperparameter obtained from the asymptotic codelength approximation for coding continuous parameters [17, 15]; alternatively, a more accurate codelength could be obtained using the more complex hierarchical message length procedure described in [18, 19]. The MML principle advocates choosing a value for the hyperparameter which minimizes the message length; i.e.,


where is the MML87 estimate of that minimizes (21) given a value of . As exact minimization of (22) is a difficult non-convex optimization problem, we use the following approximate procedure. For sufficiently large, the estimates of which minimize the message length are equivalent to the maximum likelihood estimates as neither the prior distribution nor the Fisher information depend on . An approximate solution to (22) can be obtaining by requiring that the region must include the maximum likelihood estimate ; this is given by


where is the maximum likelihood estimate of .

3.4 MML estimates

The MML estimate is found by plugging the estimate (23) into (21) and minimizing the resultant message length with respect to . Dropping terms that do not depend on the parameters , the message length (21) is



There is no closed form solution for the the values of that minimize (21). A standard approach to compute the parameter estimates in a Student

regression model is the expectation-maximization (EM) algorithm 

[4]; other methods include the expectation-conditional maximization algorithm [20] and bootstrapping approaches [21]. In this paper, we adapt the EM algorithm to find the MML estimates. It is well known that the Student-

distribution can be represented as a scale mixture of normal distributions 

[4]. We can write model (7) as:


where are the latent (unobserved) variables. Given , we have the following property [4]:


where . The latent variables are not directly observable and will be replaced by the conditional expectation in the E-step of the EM algorithm. The conditional expectation of is


Conditional on the latent variables , the message length, up to terms that are independent of , is given by


The M-step in the EM algorithm involves solving


MML estimates and can be efficiently computed using weighted least squares regression. The function is log-convex in and the MML estimate may be found by numerical optimization. In summary, the MML estimates are found by iterating the following steps until convergence:

  1. compute the weights using (28);

  2. update using the weights ;

  3. update using the current weights and estimates .

We use the maximum likelihood estimates under the assumption of Gaussian errors as the initial values of .

3.5 Selection of the degrees of freedom

In order to compute the message length (21), we require a value for the degrees of freedom parameter . A common approach to estimating is to consider only a discrete set , where for all , of candidates values for , rather than treating as a continuous parameter. We propose to select the set

such that it includes a range of distributions from those with heavy tails (e.g., the Cauchy distribution when

) to those with light tails (e.g., the normal distribution when ), where each distribution is equidistant from its neighbors. In this paper, the Jensen-Shannon (JS) divergence [22] is used to measure the distance between two distributions. Let and

denote two probability distributions. The Jensen-Shannon divergence from

to is given by


where is the Kullback–Leibler (KL) divergence [23] from to , given by


where and are probability density functions of and respectively. Unlike the KL divergence, the JS divergence is a symmetric function of and . Let denote a standard Student- probability distribution parameteraised by , and degrees of freedom. We aim to find a candidate set such that, for all ,

where is some constant. In words, this implies that the JS divergence between any two neighbouring distributions and is the same. For example, when , the set that satisfies this condition is .

3.6 Coding the model index

When deriving the message length (18) in Section 3.2, we made the assumption that the model structure is known. For the message to be uniquely decodable, the choice of must be encoded into the assertion component; i.e.,


where is the length of the assertion that encodes the model structure . Let denote the total number of columns (explanatory variables) of . We use the following prior distribution over the model structure


where is a prior distribution over the number of explanatory variables (out of ) in the model and


is the total number of models in with explanatory variables.

In the case nested model selection (e.g., selecting the order of a polynomial), the set consists of candidate models such that . Therefore, the total number of models with explanatory variables is for all . When no prior information regarding the size of the model is available, we choose the uniform prior and the assertion length is


In the more general case of all-subsets regression, the set is the set of all subsets of explanatory variables, including the null model. In this setting, the number of models with explanatory variables is , and assuming that no prior information is available regarding the size of the model, we again choose the prior distribution . This yields an assertion of length


As discussed in [24], this choice of prior distribution implies a marginal prior probability of inclusion of for each explanatory variable. In addition, this prior distribution simultaneously accounts for multiple testing by automatically punishing model sizes that have a large total number of models.

4 Discussion

4.1 Behaviour of the dimensionality penalty

Using the estimate (23) of the hyperparameter , we can write


where is the MML estimate of and SNR is the empirical signal to noise ratio of the fitted Student-t regression model. Let

For large SNR, the codelength of the assertion of the regression parameters is


which, for fixed and , depends only on the empirical SNR. This implies that the better the model fits the data, the larger the penalty for including additional explanatory variables. In contrast, when the empirical SNR is small (i.e., the model does not fit the data well), the penalty for including an additional explanatory variable is small. This property of the MML criterion is attractive since SNR is a fundamental characteristic of the fitted linear model and is invariant to model parameterization and linear transformations of the data. Therefore, the MML penalty automatically adapts to the amount of detectable signal in the data. The MML regression criterion is not the only model selection technique with this property; the MDL denoising criterion [25, p. 117] (see Section 4.3) and the recent Bayesian exponentially embedded criterion [26] both exhibit similar behaviour.

4.2 Gaussian linear regression

We can specialize the MML criterion for Student- regression models to the important case of Gaussian linear regression by finding the limit of the message length (34) as ; i.e., which is given by


where and is the digamma function. The hyperparameter is estimated following the procedure in Section 3.3 and the codelength for the model structure is discussed in Section 3.6.

In comparison to the original MML criterion for Gaussian linear regression [6], the Gaussian specialisation of the proposed Student- criterion (41) differs primarily in the selection of the hyperparameter (see Section 3.3). The original MML criterion uses which for Gaussian models which guarantees that the feasible set includes the least squares estimates. However, the estimate proposed in this paper results in a feasible set that has a strictly smaller volume than the set and therefore results in a shorter codelength. As discussed in Section 2.1, the proposed necessitates the use of the small sample MML approximation to avoid problems with the MML87 approximation.

4.3 Comparison with MDL denoising

An important consequence of using the hyperellipsoidal prior distribution for the regression parameters (13) is that the MML criterion for Gaussian linear regression (41) is very similar to the well-known normalized maximum likelihood (NML) regression criterion [7] when the signal to noise ratio is large. In the case of Gaussian linear regression, let

For large empirical SNR, , we have

and the MML criterion for the Gaussian regression model (41), ignoring the codelength for the model structure, , can be written as



and is a constant that does not depend on the sample size and dimensionality . This criterion has a similar form to the MDL denoising criterion for the Gaussian linear regression model [25, p. 117], which is given by


where and is a constant independent of and . Using Stirling’s approximation, (43) can be written as


where is another constant that does not depend on and . Similarly, using Stirling’s approximation, the MML criterion (42) can be written as


We can observe that the MML criterion (45) is very similar to the MDL criterion (44). Importantly, the fitted signal-to-noise ratio of the model appears in the MDL penalty term. Suppose that is large and , the difference between the MML and MDL criteria, for the case of Gaussian noise, is given by


When the fitted SNR of the model is small, the penalty term in the MDL criterion is also small, and potentially negative, and this can lead to substantial over-fitting [27]. In contrast, the proposed MML criterion, which makes use of the small sample approximation (6), is more robust and does not result in negative codelengths for small SNR (see Section 2.1 and 3.2). Furthermore, the MML criterion can be easily extended to other distributions, for example, the Student- distribution discussed in this paper; such extensions are very difficult in the case of normalized maximum likelihood, due to dependence on the criterion of closed-form expressions for the maximum likelihood estimates.

4.4 Model selection consistency

Recall that, for large empirical SNR, the codelength of the assertion for the regression parameters is given by


We make the assumption that the empirical covariance matrix of the predictors satisfies


where is a positive definite matrix. As the sample size , the estimates and converge in probability to their true values, and using the assumption (48), it is straightforward to show that

For large sample size and fixed number of predictors , the MML criterion is equivalent to the well-known Bayesian information criterion [15] and is therefore model selection consistent in the sense that as the sample size tends to infinity, the probability of selecting the true model tends to one. However, in contrast to BIC, the proposed MML criterion is also consistent when the sample size is fixed and the noise variance ; i.e., high SNR consistency [28]. This is easily verified by noting that the criterion (47) satisfies Theorem in [8].

4.5 Behaviour under different SNRs

The MML estimate of exhibits interesting behaviour that depends on the empirical signal strength . Recall from Section 3.4 that the MML estimate of is obtained by minimising (24) with respect to . For large signal strength , (24) reduces to


and the MML estimate, conditional on and , is given by


In this setting, the MML estimate is equivalent to the maximum likelihood estimate of adjusted to take into account that we are fitting predictors (and the intercept parameter) to the data. In contrast, when the signal strength is close to zero, (24) simplifies and the MML estimate reduces to


which is equivalent to the maximum likelihood estimate of adjusted only for the intercept parameter. The MML estimate depends on the signal strength and automatically varies the degrees of freedom used to adjust the maximum likelihood estimate; when there is little signal, predictors are not used to fit the data and thus there is no need to adjust the maximum likelihood estimate. Alternatively, when there is strong signal, the MML estimate appropriately adjusts for all degrees of freedom, with a smooth transition between these two regimes.

5 Results

5.1 Simulation

To evaluate the performance of the proposed MML criterion we performed several simulation studies. We compared our new criterion against the Bayesian information criterion (BIC) [15] and the Akaike information criterion with a correction for finite sample sizes (AIC[29], given by

where is the negative log-likelihood of the Student-t regression model, which is given by (9), is the maximum likelihood estimate of and is the number of free parameters.

The simulation was performed as follows. We generated training samples with sample size from a Student-t regression model with varying degrees of freedom . Model selection was performed using the aforementioned model selection criteria: (i) MML, (ii) BIC, and (iii) AIC. The degrees of freedom were estimated from the set

; this set includes a range of distributions from those with very heavy tails through to the Gaussian distribution (see  Section


). For each experiment, we computed the lasso regression path 

[30]. For each model structure in the path, the model coefficients were estimated using both MML and maximum likelihood. In the case of the MML criterion, the model structure was encoded using the nested prior specification (see Section 3.6). Each of the model selection criteria was then used to nominate a best model (i.e., the set of predictors and the degrees of freedom that yielded the lowest criterion score) from this path.

For each experiment, the design matrix for both the training and testing samples were generated by a multivariate normal distribution with mean and covariance matrix , where . The sample size of the training and testing sample were and respectively. The dimensionality of was chosen to be . The number of non-zero entries of was chosen to be (dense model), (balanced model) and (sparse model). For each degree of sparsity, three different signal strengths were tested: (1)  (strong), (2)  (moderate), and (3)  (weak). For each choice of sparsity and signal strength, the degrees of freedom of the generating model were selected from the set . Each simulation experiment (combination of sparsity, signal strength and true degrees of freedom) was repeated times.

True model KL divergence Prediction errors False positive False negative
t(1) Dense Weak 0.367 0.857 0.713 10.80 10.86 10.90 0 0 0 4.74 2.90 5.52
Moderate 0.458 0.906 0.905 10.72 10.56 10.80 0 0 0 3.09 0.80 2.70
Strong 0.385 0.922 0.932 13.15 13.16 13.20 0 0 0 0.63 0.30 0.42
Balanced Weak 0.292 0.586 0.435 21.65 21.74 21.72 3.52 3.13 2.04 2.81 3.47 4.51
Moderate 0.381 0.799 0.660 14.38 14.44 14.48 3.40 4.65 3.17 1.84 1.22 2.08
Strong 0.320 0.732 0.659 12.84 12.82 12.88 3.72 5.14 4.07 0.51 0.15 0.37
Sparse Weak 0.208 0.301 0.248 11.97 11.97 11.94 4.86 2.18 1.92 1.20 2.02 1.98
Moderate 0.272 0.485 0.383 9.619 9.676 9.678 5.75 4.56 3.24 0.54 0.92 1.21
Strong 0.207 0.430 0.358 8.379 8.435 8.419 4.52 5.55 4.64 0.15 0.06 0.16
t(5) Dense Weak 0.323 1.105 1.133 1.216 1.347 1.387 0 0 0 4.28 2.74 4.27
Moderate 0.287 1.040 1.057 1.168 1.306 1.319 0 0 0 0.57 0.26 0.47
Strong 0.271 1.085 1.085 1.181 1.317 1.317 0 0 0 0 0 0.03
Balanced Weak 0.282 0.492 0.405 1.186 1.231 1.224 1.17 1.59 1.08 1.85 1.87 2.33
Moderate 0.241 0.562 0.419 1.130 1.178 1.152 1.01 2.22 1.33 0.20 0.09 0.19
Strong 0.172 0.477 0.327 1.075 1.150 1.108 0.61 1.72 0.92 0 0 0
Sparse Weak 0.196 0.254 0.229 1.076 1.088 1.078 1.83 1.19 1.17 0.53 0.90 0.77
Moderate 0.131 0.233 0.176 1.030 1.049 1.037 0.69 1.24 0.89 0.01 0.01 0.01
Strong 0.094 0.136 0.142 1.003 1.017 1.022 0.17 0.54 0.61 0 0 0
t() Dense Weak 0.443 1.060 0.984 1.106 1.132 1.195 0 0 0 2.95 1.45 3.07
Moderate 0.266 1.086 1.146 0.978 1.113 1.146 0 0 0 0.09 0.09 0.23
Strong 0.295 1.080 1.080 0.987 1.107 1.107 0 0 0 0 0 0
Balanced Weak 0.290 0.491 0.414 0.980 1.010 1.005 1.26 1.69 1.14 1.38 1.26 1.6
Moderate 0.228 0.506 0.390 0.927 0.979 0.950 1.07 1.87 1.36 0.08 0.03 0.05
Strong 0.152 0.444 0.285 0.890 0.956 0.917 0.47 1.63 0.75 0 0 0
Sparse Weak 0.172 0.193 0.179 0.901 0.903 0.900 1.62 1.10 1.17 0.39 0.58 0.50
Moderate 0.114 0.161 0.163 0.860 0.870 0.872 0.70 0.97 1.05 0 0 0
Strong 0.097 0.164 0.149 0.848 0.867 0.866 0.32 0.87 0.81 0 0 0
Table 1: Average empirical KL divergence, absolute prediction errors and average number of false positives and false negatives for MML, BIC and AIC in experiments. In each experiment, the training sample size was .
Marginal Probability 0.308 0.804 0.900 0.990 0.998 0.998 0.999 0.999 1.000 1.000 1.000 1.000 1.000
Table 2: Posterior marginal inclusion probabilities of the predictors in the Boston housing data.

The performance of each criterion was measured by: (i) the Kullback–Leibler (KL) divergence [23] from the selected model to the true generating model, (ii) the mean absolute prediction error, (iii) the number of predictors erroneously included in the model (false positives), and (iv) the number of associated predictors erroneously excluded from the model (false negatives). The results are presented in Table 1.

In terms of KL divergence, MML performed substantially better than BIC and AIC in all tests and obtained particularly good performance when the true generating model was dense. In general, MML also achieved the best performance in terms of prediction error, particularly when the signal strength is weak. In contrast, it is not clear which criterion has the overall best performance in terms of the number of false positives and false negatives. In general, the number of false positives decreases when the sparsity of the model increases for all three criteria, except for the case when the true degrees of freedom is . In this case, MML tends to overfit more than the other two criteria when the signal strength is weak or moderate. As discussed in Section 3.2, when the signal is weak, the MML criterion assigns a smaller penalty to each predictor in the model and “false” predictors are therefore more likely to be included in the best model. Although this leads to some overfitting, it has little negative impact on prediction as the signal being fitted is weak. In contrast, when the signal is strong, the MML criterion generally obtained the smallest number of false positives. The universally strong predictive performance of MML highlights the difference in aims between selecting the true model versus making good predictions, in the sense that even in the case that BIC and AIC achieved less false positives and negatives, they usually performed worse in terms of both prediction metrics.

5.2 Real data

We applied our proposed MML model section criterion to the Boston housing data. This data reports the prices of owner-occupied homes in the Boston area. It contains samples and

predictors. The response variable is the median value of owner-occupied homes in

s. The degrees of freedom was estimated from the set . We performed an exhaustive search over the space of model structures (i.e., all subset selection) and computed the message length for each model. The model with the shortest message length ( nits) included predictors with degrees of freedom . In contrast, the best Gaussian model had a message length of nits, which is significantly longer; this strongly suggests that a heavy-tailed error model is appropriate for this data.

A particular advantage of MML is that codelengths can be interpreted as posterior probabilities and this can be used to create a posterior distribution over the space of model structures. Let be the model space and be the minimum message length achievable by model . The posterior probability assigned to model by MML is given by


Using these posterior probabilities, we computed the marginal probability of inclusion for each of the predictors (see Table 2). We can observe that out of the predictors have marginal probability greater than . Only the predictor recording the proportion of non-retail business acres per town (indus), which had a marginal probability of , was deemed unassociated.

To test the predictive performance of the proposed MML criterion, we performed another set of simulations. The Boston housing data was divided randomly into training and testing sets of equal size. We performed model selection using the training samples and tested the predictive performance of the MML, BIC and AIC criteria on the testing samples. All three criteria selected for of the tests and never selected the Gaussian model. The prediction performance is summarised in Table 3. The models selected by MML attained, on average, a smaller negative log-likelihood for the testing data while AIC marginally outperformed MML in terms of absolute prediction error.

t(1) t(1.9) t(5) t()
MML NLL 2.9471 2.8892 2.9080 3.0494
Pred. Error 3.3736 3.3572 3.3449 3.4869
BIC NLL 2.9696 2.9096 2.9329 3.0723
Pred. Error 3.4261 3.4082 3.4142 3.5571
AICc NLL 2.9534 2.8894 2.9096 3.0484
Pred. Error 3.3697 3.3388 3.3382 3.4777
Table 3: Average negative log-likelihood (NLL) and absolute prediction error for MML, BIC and AIC in the Boston housing data, obtained from 50 cross-validation experiments. The training and testing sample were both of size .


  • Fernández and Steel [1998]

    C. Fernández and M. F. Steel, “On Bayesian modeling of fat tails and skewness,”

    Journal of the American Statistical Association, vol. 93, no. 441, pp. 359–371, 1998.
  • Jacquier et al. [2002] E. Jacquier, N. G. Polson, and P. E. Rossi, “Bayesian analysis of stochastic volatility models,” Journal of Business & Economic Statistics, vol. 20, no. 1, pp. 69–87, 2002.
  • West [1984] M. West, “Outlier models and prior distributions in Bayesian linear regression,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 431–439, 1984.
  • Lange et al. [1989] K. L. Lange, R. J. Little, and J. M. Taylor, “Robust statistical modeling using the t distribution,” Journal of the American Statistical Association, vol. 84, no. 408, pp. 881–896, 1989.
  • Wallace [2005] C. S. Wallace, Statistical and inductive inference by minimum message length.   Springer Science & Business Media, 2005.
  • Schmidt and Makalic [2009] D. F. Schmidt and E. Makalic, “MML invariant linear regression,” in

    Australasian Joint Conference on Artificial Intelligence

    .   Springer, 2009, pp. 312–321.
  • Rissanen [2000] J. Rissanen, “MDL denoising,” IEEE Transactions on Information Theory, vol. 46, no. 7, pp. 2537–2543, 2000.
  • Schmidt and Makalic [2012] D. F. Schmidt and E. Makalic, “The consistency of MDL for linear regression models with increasing signal-to-noise ratio,” IEEE Transactions on Signal Processing, vol. 60, no. 3, pp. 1508–1510, 2012.
  • Wallace and Boulton [1968] C. S. Wallace and D. M. Boulton, “An information measure for classification,” The Computer Journal, vol. 11, no. 2, pp. 185–194, 1968.
  • Wallace and Boulton [1975] ——, “An invariant Bayes method for point estimation,” Classification Society Bulletin, vol. 3, no. 3, pp. 11–34, 1975.
  • Wallace and Freeman [1987] C. S. Wallace and P. R. Freeman, “Estimation and inference by compact coding,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 240–265, 1987.
  • Barron [1985] A. R. Barron, “Logically smooth density estimation,” Ph.D. dissertation, 1985.
  • Farr and Wallace [2002] G. Farr and C. S. Wallace, “The complexity of strict minimum message length inference,” The Computer Journal, vol. 45, no. 3, pp. 285–292, 2002.
  • Grünwald et al. [2005] P. D. Grünwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length: Theory and applications.   MIT press, 2005.
  • Schwarz et al. [1978] G. Schwarz et al., “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
  • Fonseca et al. [2008] T. C. Fonseca, M. A. Ferreira, and H. S. Migon, “Objective Bayesian analysis for the Student-t regression model,” Biometrika, vol. 95, no. 2, pp. 325–333, 2008.
  • Rissanen [1978] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.
  • Makalic and Schmidt [2009] E. Makalic and D. F. Schmidt, “Minimum message length shrinkage estimation,” Statistics & Probability Letters, vol. 79, no. 9, pp. 1155–1161, 2009.
  • Schmidt et al. [2016] D. F. Schmidt, E. Makalic, and J. L. Hopper, “Approximating message lengths of hierarchical bayesian models using posterior sampling,” in Australasian Joint Conference on Artificial Intelligence.   Springer, 2016, pp. 482–494.
  • Liu and Rubin [1995] C. Liu and D. B. Rubin, “ML estimation of the t distribution using EM and its extensions, ECM and ECME,” Statistica Sinica, pp. 19–39, 1995.
  • Pianto [2010] D. M. Pianto, “A bootstrap estimator for the Student-t regression model,” 2010.
  • Lin [1991] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information theory, vol. 37, no. 1, pp. 145–151, 1991.
  • Kullback and Leibler [1951] S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
  • Scott et al. [2010] J. G. Scott, J. O. Berger et al., “Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem,” The Annals of Statistics, vol. 38, no. 5, pp. 2587–2619, 2010.
  • Rissanen [2007] J. Rissanen, Information and complexity in statistical modeling.   Springer Science & Business Media, 2007.
  • Zhu and Kay [2018] Z. Zhu and S. Kay, “On bayesian exponentially embedded family for model order selection,” IEEE Transactions on Signal Processing, vol. 66, no. 4, pp. 933–943, 2018.
  • Roos et al. [2005] T. Roos, P. Myllymäki, H. Tirri et al., “On the behavior of MDL denoising,” AISTATS 2005, 2005.
  • Ding and Kay [2011] Q. Ding and S. Kay, “Inconsistency of the MDL: On the performance of model order selection criteria with increasing signal-to-noise ratio,” IEEE Transactions on Signal Processing, vol. 59, no. 5, pp. 1959–1969, 2011.
  • Hurvich and Tsai [1993]

    C. M. Hurvich and C.-L. Tsai, “A corrected Akaike information criterion for vector autoregressive model selection,”

    Journal of time series analysis, vol. 14, no. 3, pp. 271–279, 1993.
  • Tibshirani [1996] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society (Series B), vol. 58, pp. 267–288, 1996.