1 Introduction
Consider a vector of
observations generated by the linear regression model(1) 
where is the design matrix of explanatory variables with each , is the vector of regression coefficients, is the intercept parameter, is a vector of s of length , and is a vector of random errors. In this paper, the random disturbances are assumed to be independently and identically distributed as per a Student distribution with location zero, scale , and degrees of freedom. The Student linear regression model finds frequent application in modelling heavytailed data [1, 2]. Compared with the more popular Gaussian linear regression model the Student
linear regression model has been shown to have better performance in terms of reducing the negative effects caused by outliers
[3, 4].This paper investigates parameter estimation and variable selection for the Student linear regression model using the minimum message length (MML) principle [5] and is an extension of the MML model selection criterion described in [6]. The MML criterion is known to be closely related to Rissanen’s normalized maximum likelihood (NML) denoising approach [7] and shares a number of its attractive properties (e.g., consistency) [8]. These two criteria are virtually indistinguishable when the signaltonoise ratio is moderate to large. In contrast, when the signaltonoise ratio is small, the NMLbased approaches are known to perform poorly; due subtle differences between the complexity penalties of MML and NML, the MML approach is less prone to overfitting in these regimes.
While the MML criterion has been shown to perform well in a range of experiments using real and simulated data, it is limited to Gaussian linear regression models and is not invariant to translations of the target . In this manuscript, we generalize the MML approach to the Student linear regression model and derive a new model selection criterion that is invariant to linear transformations of both the explanatory variables and the targets. The latter invariance property ensures that the criterion does not depend on the particular choice of the measurement units in which the data is recorded.
2 Minimum Message Length
The minimum message length (MML) principle [5, 9, 10, 11]
was introduced by C. S. Wallace and D. M. Boulton in the late 1960s. The underlying philosophy of the MML principle is that one can use data compression to learn structure from data. It is well known that if there is little structure (regularities) in the data, the probability of greatly compressing the data is vanishing small (no hypercompression inequality
[12]). This result also implies that models which attain high levels of compression are likely to have captured real attributes of the underlying data generating mechanism.Imagine one wishes to encode the observed data into a message string (e.g., a binary string) that is uniquely decodable. In the MML framework, this message string is called an explanation and consists of two components, namely the assertion and the detail. The assertion encodes a statement describing a statistical model (i.e., model structure and parameters) and the detail encodes the data using the model specified in the assertion. Inference within the MML framework proceeds by finding the model that results in the greater compression of the data; i.e., has the shortest twopart message. While computing the exact MML codelength for a given problem is in general NPhard [13], there exist a range of computationally tractable approximations for computing message lengths.
The most commonly used message length formula is the MML87 approximation [11]. Suppose we are given a vector of data . Let
denote the set of probability density functions indexed by the
dimensional parameter vector where is the parameter space. The message assertion needs to name a specific member from the uncountable set , however, from Shannon’s theory information, codelengths can only be constructed over countable sets. The core of MML is to quantize the set to a carefully chosen countable subset for which a code can be constructed. For each there is an associated Voronoi cell , called an uncertainty region, containing those distributions that are closest in some sense to . In the derivation of the MML87 approximation formula, the coding probability of a member is given by(2) 
where is the Bayesian prior density over the set and is the volume of the Voronoi cell . In the MML87 approximation, the volume is given by
(3) 
where is the determinant of the Fisher information matrix and
is the normalized second moment of an optimal quantizing lattice. The uncertainty region can be interpreted as a set of distributions that are considered indistinguishable on the basis of the observed data and is related to the concept of distinguishable distributions in the MDL literature
[14].Recall that the MML message consists of two components, the assertion and the detail. From Shannon’s theory of information, the length of the assertion, which states the member that will be used to compress the data, is given by . The length of the detail, which encodes the data using the model named in the assertion, is given by (see [5] for more details). Using the MML87 approximation, the codelength of data and a model is given by
(4) 
where is the digamma function and the constant can be approximated by ([5], pp. 257–258)
(5) 
To estimate a model from data using MML, we find the estimate that minimizes the message length (4). The key idea behind the MML approach is to minimize the tradeoff between the complexity of the model (measured by the assertion length) against the ability of the model to fit the observed data (measured by the detail length). Under suitable regularity conditions, the MML87 codelength is asymptotically equivalent to the wellknown Bayesian information criterion [15], and the MML87 estimator is asymptotically equivalent to the maximum likelihood estimator.
From (2), it is clear that MML uses the prior distribution to construct codewords for members of and is therefore a Bayesian procedure. The MML approach can be seen as performing maximum a posteriori (MAP) over the quantized space , with the coding probability
playing the role of a discrete prior distribution, and the codelength as a counterpart to the posterior probability. Every model is therefore given a posterior probability mass (i.e., codelength) which allows for discrimination between models with different structures (e.g., nonnested models). A key difference between standard Bayesian estimators such as the MAP and posterior mean estimators, and MML is that MML estimators are parameterization invariant; i.e., they provide the same inferences under onetoone, continuous reparameterizations of the parameter space.
2.1 A small sample message length approximation
The MML87 approximation relies on a set of regularity conditions on the likelihood function and the prior distributions [5]. An important assumption of MML87 is that the prior distribution is approximately uniform over the entire uncertainty region . If this condition is violated, the MML87 approximation can break down and the coding probability of the assertion (2) can be greater than 1. One solution to this problem, as described in [5], is to use the smallsample message length approximation
(6) 
which smoothly bounds to be less than 1. The small sample approximation retains all the attractive properties of the MML87 approximation while being robust to prior distributions that are highly peaked. The cost of introducing additional robustness to the MML87 approximation is that minimizing the small sample codelength (6) is, in general, more difficult than minimizing the original MML87 codelength (4).
3 MML Studentt linear regression
Given a vector of observations , a design matrix containing predictors indexed by the model structure where is a set of candidate models, the Student linear regression model in (1) can be written as:
(7) 
where is the th row of the design matrix . In this paper we assume, without loss of generality, that the columns of the design matrix are standardized to have zero mean. In the remainder of this paper, the model structure index is omitted when clear from the context. Under the Student regression model, the probability density function of is given by
(8) 
where , is a scale parameter and is the degrees of freedom. The negative loglikelihood of , is
(9)  
where . In order to apply the MML87 approximation (4), we require the Fisher information matrix for this model. The Fisher information matrix for the Student regression model is [4, 16]
and the determinant of the Fisher information matrix is
(10) 
Note that the determinant of the Fisher information matrix (10) for the Student regression model is equal to the determinant of the Fisher information for the Gaussian linear regression model,
(11) 
scaled by the factor which only depends on and . When the degrees of freedom , the determinant of the Fisher information matrix (10) for the Student model reduces to that of the Gaussian model (11).
3.1 Prior distribution for the regression parameters
Recall from Section 2 that MML is a Bayesian principle and therefore requires the specification of prior distributions for all model parameters. Assuming that we do not have prior knowledge regarding the values of the coefficients, it is reasonable to give every subset of coefficients
an equal prior probability. This may be done by placing an improper uniform prior over each coefficient
, i.e.,Using such an improper prior within the MML87 framework results in an infinite message length and bounding of the parameter space is required to ensure propriety. There exist many different approaches to bounding the parameter space; for example, one could use a uniform prior where . In this manuscript, we propose the following novel hyperellipsoid bounding for :
(12) 
where is a hyperparameter that controls the size of the bounding region; the choice the hyperparameter is discussed in Section 3.3. The bounding hyperellipsoid uses intrinsic geometric properties of the model and has an important advantage over the conventional rectangular bounding. Our novel bounding depends on a single hyperparameter which has a natural interpretation: by noting that the quantity can be interpreted as the fitted sumofsquares, the hyperparameter
is seen to directly control the maximum allowable signal strength (i.e., variance).
The set of coefficients that satisfies (12) is called the feasible parameter set and is aligned to the principle axes of the explanatory variables. Using a uniform prior over this feasible parameter set yields the following proper prior distribution for the regression coefficients:
(13) 
For the intercept parameter , we use the uniform prior
(14) 
and for the scale parameter , we use the scaleinvariant prior distribution
(15) 
The joint prior distribution for all parameters is
(16) 
The prior distributions for the intercept (14) and scale parameters (15) are improper and also require a parameter space restriction. However, these parameters are common to all model structures and therefore the particular choice of the bounding region does not affect the relative ordering of message lengths. Therefore, the choice of the bounding region for these parameter plays no role in model selection.
3.2 Message length for Student regression models
The choice of the prior distribution for the regression coefficients is potentially problematic when used in conjunction with the MML87 message length formula. Given a data set that is poorly explained by the predictors (i.e., a fitted signaltonoise ratio close to zero), the MML87 estimate of the noise variance, , will be large. From (3), we have
(17) 
where is the volume of the uncertainty region used to code the parameter estimates. A large value of results in a large volume of the uncertainty region. Recall from Section 2 that MML87 approximates the coding probability for the assertion (2) by a product of the prior distribution and the volume of the uncertainty region. When the volume of the uncertainty region exceeds the volume of the restricted parameter set , the coding probability of the assertion will exceed one, which is clearly nonsensical. For any given value of (i.e., the size of the restricted parameter set), there exists a such that .
One solution to this problem is to use the smallsample message length approximation formula (6) described in Section 2.1. The assertion codelength for , , is obtained from the smallsample approximation while the assertion codelength for and , , does not depend on and can be safely obtained using the standard MML87 formula. Given a value of the hyperparameter , the total message length of the data and the parameters is
(18) 
where
(19)  
and the quantization constants can be approximated by
(20) 
A consequence of using the hyperellipsoidal bounding (12) is that the term that appears in both the prior distribution for (13) and the Fisher information (10) cancels in the formula for the assertion.
3.3 Estimation of the hyperparameter
The message length (18) using the joint prior distribution (16) depends crucially on the value of the hyperparameter . For the message to be uniquely decodable, the value of must be encoded into the assertion component. The complete message length is
(21) 
where is the codelength for stating the hyperparameter obtained from the asymptotic codelength approximation for coding continuous parameters [17, 15]; alternatively, a more accurate codelength could be obtained using the more complex hierarchical message length procedure described in [18, 19]. The MML principle advocates choosing a value for the hyperparameter which minimizes the message length; i.e.,
(22) 
where is the MML87 estimate of that minimizes (21) given a value of . As exact minimization of (22) is a difficult nonconvex optimization problem, we use the following approximate procedure. For sufficiently large, the estimates of which minimize the message length are equivalent to the maximum likelihood estimates as neither the prior distribution nor the Fisher information depend on . An approximate solution to (22) can be obtaining by requiring that the region must include the maximum likelihood estimate ; this is given by
(23) 
where is the maximum likelihood estimate of .
3.4 MML estimates
The MML estimate is found by plugging the estimate (23) into (21) and minimizing the resultant message length with respect to . Dropping terms that do not depend on the parameters , the message length (21) is
(24) 
where
There is no closed form solution for the the values of that minimize (21). A standard approach to compute the parameter estimates in a Student
regression model is the expectationmaximization (EM) algorithm
[4]; other methods include the expectationconditional maximization algorithm [20] and bootstrapping approaches [21]. In this paper, we adapt the EM algorithm to find the MML estimates. It is well known that the Studentdistribution can be represented as a scale mixture of normal distributions
[4]. We can write model (7) as:(25)  
(26) 
where are the latent (unobserved) variables. Given , we have the following property [4]:
(27) 
where . The latent variables are not directly observable and will be replaced by the conditional expectation in the Estep of the EM algorithm. The conditional expectation of is
(28) 
Conditional on the latent variables , the message length, up to terms that are independent of , is given by
(29) 
The Mstep in the EM algorithm involves solving
(30)  
(31) 
MML estimates and can be efficiently computed using weighted least squares regression. The function is logconvex in and the MML estimate may be found by numerical optimization. In summary, the MML estimates are found by iterating the following steps until convergence:

compute the weights using (28);

update using the weights ;

update using the current weights and estimates .
We use the maximum likelihood estimates under the assumption of Gaussian errors as the initial values of .
3.5 Selection of the degrees of freedom
In order to compute the message length (21), we require a value for the degrees of freedom parameter . A common approach to estimating is to consider only a discrete set , where for all , of candidates values for , rather than treating as a continuous parameter. We propose to select the set
such that it includes a range of distributions from those with heavy tails (e.g., the Cauchy distribution when
) to those with light tails (e.g., the normal distribution when ), where each distribution is equidistant from its neighbors. In this paper, the JensenShannon (JS) divergence [22] is used to measure the distance between two distributions. Let anddenote two probability distributions. The JensenShannon divergence from
to is given by(32) 
where is the Kullback–Leibler (KL) divergence [23] from to , given by
(33) 
where and are probability density functions of and respectively. Unlike the KL divergence, the JS divergence is a symmetric function of and . Let denote a standard Student probability distribution parameteraised by , and degrees of freedom. We aim to find a candidate set such that, for all ,
where is some constant. In words, this implies that the JS divergence between any two neighbouring distributions and is the same. For example, when , the set that satisfies this condition is .
3.6 Coding the model index
When deriving the message length (18) in Section 3.2, we made the assumption that the model structure is known. For the message to be uniquely decodable, the choice of must be encoded into the assertion component; i.e.,
(34) 
where is the length of the assertion that encodes the model structure . Let denote the total number of columns (explanatory variables) of . We use the following prior distribution over the model structure
(35) 
where is a prior distribution over the number of explanatory variables (out of ) in the model and
(36) 
is the total number of models in with explanatory variables.
In the case nested model selection (e.g., selecting the order of a polynomial), the set consists of candidate models such that . Therefore, the total number of models with explanatory variables is for all . When no prior information regarding the size of the model is available, we choose the uniform prior and the assertion length is
(37) 
In the more general case of allsubsets regression, the set is the set of all subsets of explanatory variables, including the null model. In this setting, the number of models with explanatory variables is , and assuming that no prior information is available regarding the size of the model, we again choose the prior distribution . This yields an assertion of length
(38) 
As discussed in [24], this choice of prior distribution implies a marginal prior probability of inclusion of for each explanatory variable. In addition, this prior distribution simultaneously accounts for multiple testing by automatically punishing model sizes that have a large total number of models.
4 Discussion
4.1 Behaviour of the dimensionality penalty
Using the estimate (23) of the hyperparameter , we can write
(39) 
where is the MML estimate of and SNR is the empirical signal to noise ratio of the fitted Studentt regression model. Let
For large SNR, the codelength of the assertion of the regression parameters is
(40) 
which, for fixed and , depends only on the empirical SNR. This implies that the better the model fits the data, the larger the penalty for including additional explanatory variables. In contrast, when the empirical SNR is small (i.e., the model does not fit the data well), the penalty for including an additional explanatory variable is small. This property of the MML criterion is attractive since SNR is a fundamental characteristic of the fitted linear model and is invariant to model parameterization and linear transformations of the data. Therefore, the MML penalty automatically adapts to the amount of detectable signal in the data. The MML regression criterion is not the only model selection technique with this property; the MDL denoising criterion [25, p. 117] (see Section 4.3) and the recent Bayesian exponentially embedded criterion [26] both exhibit similar behaviour.
4.2 Gaussian linear regression
We can specialize the MML criterion for Student regression models to the important case of Gaussian linear regression by finding the limit of the message length (34) as ; i.e., which is given by
(41)  
where and is the digamma function. The hyperparameter is estimated following the procedure in Section 3.3 and the codelength for the model structure is discussed in Section 3.6.
In comparison to the original MML criterion for Gaussian linear regression [6], the Gaussian specialisation of the proposed Student criterion (41) differs primarily in the selection of the hyperparameter (see Section 3.3). The original MML criterion uses which for Gaussian models which guarantees that the feasible set includes the least squares estimates. However, the estimate proposed in this paper results in a feasible set that has a strictly smaller volume than the set and therefore results in a shorter codelength. As discussed in Section 2.1, the proposed necessitates the use of the small sample MML approximation to avoid problems with the MML87 approximation.
4.3 Comparison with MDL denoising
An important consequence of using the hyperellipsoidal prior distribution for the regression parameters (13) is that the MML criterion for Gaussian linear regression (41) is very similar to the wellknown normalized maximum likelihood (NML) regression criterion [7] when the signal to noise ratio is large. In the case of Gaussian linear regression, let
For large empirical SNR, , we have
and the MML criterion for the Gaussian regression model (41), ignoring the codelength for the model structure, , can be written as
(42)  
where
and is a constant that does not depend on the sample size and dimensionality . This criterion has a similar form to the MDL denoising criterion for the Gaussian linear regression model [25, p. 117], which is given by
(43)  
where and is a constant independent of and . Using Stirling’s approximation, (43) can be written as
(44)  
where is another constant that does not depend on and . Similarly, using Stirling’s approximation, the MML criterion (42) can be written as
(45)  
We can observe that the MML criterion (45) is very similar to the MDL criterion (44). Importantly, the fitted signaltonoise ratio of the model appears in the MDL penalty term. Suppose that is large and , the difference between the MML and MDL criteria, for the case of Gaussian noise, is given by
(46) 
When the fitted SNR of the model is small, the penalty term in the MDL criterion is also small, and potentially negative, and this can lead to substantial overfitting [27]. In contrast, the proposed MML criterion, which makes use of the small sample approximation (6), is more robust and does not result in negative codelengths for small SNR (see Section 2.1 and 3.2). Furthermore, the MML criterion can be easily extended to other distributions, for example, the Student distribution discussed in this paper; such extensions are very difficult in the case of normalized maximum likelihood, due to dependence on the criterion of closedform expressions for the maximum likelihood estimates.
4.4 Model selection consistency
Recall that, for large empirical SNR, the codelength of the assertion for the regression parameters is given by
(47) 
We make the assumption that the empirical covariance matrix of the predictors satisfies
(48) 
where is a positive definite matrix. As the sample size , the estimates and converge in probability to their true values, and using the assumption (48), it is straightforward to show that
For large sample size and fixed number of predictors , the MML criterion is equivalent to the wellknown Bayesian information criterion [15] and is therefore model selection consistent in the sense that as the sample size tends to infinity, the probability of selecting the true model tends to one. However, in contrast to BIC, the proposed MML criterion is also consistent when the sample size is fixed and the noise variance ; i.e., high SNR consistency [28]. This is easily verified by noting that the criterion (47) satisfies Theorem in [8].
4.5 Behaviour under different SNRs
The MML estimate of exhibits interesting behaviour that depends on the empirical signal strength . Recall from Section 3.4 that the MML estimate of is obtained by minimising (24) with respect to . For large signal strength , (24) reduces to
(49) 
and the MML estimate, conditional on and , is given by
(50) 
In this setting, the MML estimate is equivalent to the maximum likelihood estimate of adjusted to take into account that we are fitting predictors (and the intercept parameter) to the data. In contrast, when the signal strength is close to zero, (24) simplifies and the MML estimate reduces to
(51) 
which is equivalent to the maximum likelihood estimate of adjusted only for the intercept parameter. The MML estimate depends on the signal strength and automatically varies the degrees of freedom used to adjust the maximum likelihood estimate; when there is little signal, predictors are not used to fit the data and thus there is no need to adjust the maximum likelihood estimate. Alternatively, when there is strong signal, the MML estimate appropriately adjusts for all degrees of freedom, with a smooth transition between these two regimes.
5 Results
5.1 Simulation
To evaluate the performance of the proposed MML criterion we performed several simulation studies. We compared our new criterion against the Bayesian information criterion (BIC) [15] and the Akaike information criterion with a correction for finite sample sizes (AIC) [29], given by
where is the negative loglikelihood of the Studentt regression model, which is given by (9), is the maximum likelihood estimate of and is the number of free parameters.
The simulation was performed as follows. We generated training samples with sample size from a Studentt regression model with varying degrees of freedom . Model selection was performed using the aforementioned model selection criteria: (i) MML, (ii) BIC, and (iii) AIC. The degrees of freedom were estimated from the set
; this set includes a range of distributions from those with very heavy tails through to the Gaussian distribution (see Section
3.5). For each experiment, we computed the lasso regression path
[30]. For each model structure in the path, the model coefficients were estimated using both MML and maximum likelihood. In the case of the MML criterion, the model structure was encoded using the nested prior specification (see Section 3.6). Each of the model selection criteria was then used to nominate a best model (i.e., the set of predictors and the degrees of freedom that yielded the lowest criterion score) from this path.For each experiment, the design matrix for both the training and testing samples were generated by a multivariate normal distribution with mean and covariance matrix , where . The sample size of the training and testing sample were and respectively. The dimensionality of was chosen to be . The number of nonzero entries of was chosen to be (dense model), (balanced model) and (sparse model). For each degree of sparsity, three different signal strengths were tested: (1) (strong), (2) (moderate), and (3) (weak). For each choice of sparsity and signal strength, the degrees of freedom of the generating model were selected from the set . Each simulation experiment (combination of sparsity, signal strength and true degrees of freedom) was repeated times.
True model  KL divergence  Prediction errors  False positive  False negative  

d.f.  Sparsity  Signal  MML  BIC  AIC  MML  BIC  AIC  MML  BIC  AIC  MML  BIC  AIC 
t(1)  Dense  Weak  0.367  0.857  0.713  10.80  10.86  10.90  0  0  0  4.74  2.90  5.52 
Moderate  0.458  0.906  0.905  10.72  10.56  10.80  0  0  0  3.09  0.80  2.70  
Strong  0.385  0.922  0.932  13.15  13.16  13.20  0  0  0  0.63  0.30  0.42  
Balanced  Weak  0.292  0.586  0.435  21.65  21.74  21.72  3.52  3.13  2.04  2.81  3.47  4.51  
Moderate  0.381  0.799  0.660  14.38  14.44  14.48  3.40  4.65  3.17  1.84  1.22  2.08  
Strong  0.320  0.732  0.659  12.84  12.82  12.88  3.72  5.14  4.07  0.51  0.15  0.37  
Sparse  Weak  0.208  0.301  0.248  11.97  11.97  11.94  4.86  2.18  1.92  1.20  2.02  1.98  
Moderate  0.272  0.485  0.383  9.619  9.676  9.678  5.75  4.56  3.24  0.54  0.92  1.21  
Strong  0.207  0.430  0.358  8.379  8.435  8.419  4.52  5.55  4.64  0.15  0.06  0.16  
t(5)  Dense  Weak  0.323  1.105  1.133  1.216  1.347  1.387  0  0  0  4.28  2.74  4.27 
Moderate  0.287  1.040  1.057  1.168  1.306  1.319  0  0  0  0.57  0.26  0.47  
Strong  0.271  1.085  1.085  1.181  1.317  1.317  0  0  0  0  0  0.03  
Balanced  Weak  0.282  0.492  0.405  1.186  1.231  1.224  1.17  1.59  1.08  1.85  1.87  2.33  
Moderate  0.241  0.562  0.419  1.130  1.178  1.152  1.01  2.22  1.33  0.20  0.09  0.19  
Strong  0.172  0.477  0.327  1.075  1.150  1.108  0.61  1.72  0.92  0  0  0  
Sparse  Weak  0.196  0.254  0.229  1.076  1.088  1.078  1.83  1.19  1.17  0.53  0.90  0.77  
Moderate  0.131  0.233  0.176  1.030  1.049  1.037  0.69  1.24  0.89  0.01  0.01  0.01  
Strong  0.094  0.136  0.142  1.003  1.017  1.022  0.17  0.54  0.61  0  0  0  
t()  Dense  Weak  0.443  1.060  0.984  1.106  1.132  1.195  0  0  0  2.95  1.45  3.07 
Moderate  0.266  1.086  1.146  0.978  1.113  1.146  0  0  0  0.09  0.09  0.23  
Strong  0.295  1.080  1.080  0.987  1.107  1.107  0  0  0  0  0  0  
Balanced  Weak  0.290  0.491  0.414  0.980  1.010  1.005  1.26  1.69  1.14  1.38  1.26  1.6  
Moderate  0.228  0.506  0.390  0.927  0.979  0.950  1.07  1.87  1.36  0.08  0.03  0.05  
Strong  0.152  0.444  0.285  0.890  0.956  0.917  0.47  1.63  0.75  0  0  0  
Sparse  Weak  0.172  0.193  0.179  0.901  0.903  0.900  1.62  1.10  1.17  0.39  0.58  0.50  
Moderate  0.114  0.161  0.163  0.860  0.870  0.872  0.70  0.97  1.05  0  0  0  
Strong  0.097  0.164  0.149  0.848  0.867  0.866  0.32  0.87  0.81  0  0  0 
INDUS  CHAS  AGE  ZN  CRIM  NOX  RAD  TAX  BLACK  DIS  LSTAT  PTRATIO  RM  
Marginal Probability  0.308  0.804  0.900  0.990  0.998  0.998  0.999  0.999  1.000  1.000  1.000  1.000  1.000 
The performance of each criterion was measured by: (i) the Kullback–Leibler (KL) divergence [23] from the selected model to the true generating model, (ii) the mean absolute prediction error, (iii) the number of predictors erroneously included in the model (false positives), and (iv) the number of associated predictors erroneously excluded from the model (false negatives). The results are presented in Table 1.
In terms of KL divergence, MML performed substantially better than BIC and AIC in all tests and obtained particularly good performance when the true generating model was dense. In general, MML also achieved the best performance in terms of prediction error, particularly when the signal strength is weak. In contrast, it is not clear which criterion has the overall best performance in terms of the number of false positives and false negatives. In general, the number of false positives decreases when the sparsity of the model increases for all three criteria, except for the case when the true degrees of freedom is . In this case, MML tends to overfit more than the other two criteria when the signal strength is weak or moderate. As discussed in Section 3.2, when the signal is weak, the MML criterion assigns a smaller penalty to each predictor in the model and “false” predictors are therefore more likely to be included in the best model. Although this leads to some overfitting, it has little negative impact on prediction as the signal being fitted is weak. In contrast, when the signal is strong, the MML criterion generally obtained the smallest number of false positives. The universally strong predictive performance of MML highlights the difference in aims between selecting the true model versus making good predictions, in the sense that even in the case that BIC and AIC achieved less false positives and negatives, they usually performed worse in terms of both prediction metrics.
5.2 Real data
We applied our proposed MML model section criterion to the Boston housing data. This data reports the prices of owneroccupied homes in the Boston area. It contains samples and
predictors. The response variable is the median value of owneroccupied homes in
s. The degrees of freedom was estimated from the set . We performed an exhaustive search over the space of model structures (i.e., all subset selection) and computed the message length for each model. The model with the shortest message length ( nits) included predictors with degrees of freedom . In contrast, the best Gaussian model had a message length of nits, which is significantly longer; this strongly suggests that a heavytailed error model is appropriate for this data.A particular advantage of MML is that codelengths can be interpreted as posterior probabilities and this can be used to create a posterior distribution over the space of model structures. Let be the model space and be the minimum message length achievable by model . The posterior probability assigned to model by MML is given by
(52) 
Using these posterior probabilities, we computed the marginal probability of inclusion for each of the predictors (see Table 2). We can observe that out of the predictors have marginal probability greater than . Only the predictor recording the proportion of nonretail business acres per town (indus), which had a marginal probability of , was deemed unassociated.
To test the predictive performance of the proposed MML criterion, we performed another set of simulations. The Boston housing data was divided randomly into training and testing sets of equal size. We performed model selection using the training samples and tested the predictive performance of the MML, BIC and AIC criteria on the testing samples. All three criteria selected for of the tests and never selected the Gaussian model. The prediction performance is summarised in Table 3. The models selected by MML attained, on average, a smaller negative loglikelihood for the testing data while AIC marginally outperformed MML in terms of absolute prediction error.
t(1)  t(1.9)  t(5)  t()  

MML  NLL  2.9471  2.8892  2.9080  3.0494 
Pred. Error  3.3736  3.3572  3.3449  3.4869  
BIC  NLL  2.9696  2.9096  2.9329  3.0723 
Pred. Error  3.4261  3.4082  3.4142  3.5571  
AICc  NLL  2.9534  2.8894  2.9096  3.0484 
Pred. Error  3.3697  3.3388  3.3382  3.4777 
References

Fernández and Steel [1998]
C. Fernández and M. F. Steel, “On Bayesian modeling of fat tails and skewness,”
Journal of the American Statistical Association, vol. 93, no. 441, pp. 359–371, 1998.  Jacquier et al. [2002] E. Jacquier, N. G. Polson, and P. E. Rossi, “Bayesian analysis of stochastic volatility models,” Journal of Business & Economic Statistics, vol. 20, no. 1, pp. 69–87, 2002.
 West [1984] M. West, “Outlier models and prior distributions in Bayesian linear regression,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 431–439, 1984.
 Lange et al. [1989] K. L. Lange, R. J. Little, and J. M. Taylor, “Robust statistical modeling using the t distribution,” Journal of the American Statistical Association, vol. 84, no. 408, pp. 881–896, 1989.
 Wallace [2005] C. S. Wallace, Statistical and inductive inference by minimum message length. Springer Science & Business Media, 2005.

Schmidt and Makalic [2009]
D. F. Schmidt and E. Makalic, “MML invariant linear regression,” in
Australasian Joint Conference on Artificial Intelligence
. Springer, 2009, pp. 312–321.  Rissanen [2000] J. Rissanen, “MDL denoising,” IEEE Transactions on Information Theory, vol. 46, no. 7, pp. 2537–2543, 2000.
 Schmidt and Makalic [2012] D. F. Schmidt and E. Makalic, “The consistency of MDL for linear regression models with increasing signaltonoise ratio,” IEEE Transactions on Signal Processing, vol. 60, no. 3, pp. 1508–1510, 2012.
 Wallace and Boulton [1968] C. S. Wallace and D. M. Boulton, “An information measure for classification,” The Computer Journal, vol. 11, no. 2, pp. 185–194, 1968.
 Wallace and Boulton [1975] ——, “An invariant Bayes method for point estimation,” Classification Society Bulletin, vol. 3, no. 3, pp. 11–34, 1975.
 Wallace and Freeman [1987] C. S. Wallace and P. R. Freeman, “Estimation and inference by compact coding,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 240–265, 1987.
 Barron [1985] A. R. Barron, “Logically smooth density estimation,” Ph.D. dissertation, 1985.
 Farr and Wallace [2002] G. Farr and C. S. Wallace, “The complexity of strict minimum message length inference,” The Computer Journal, vol. 45, no. 3, pp. 285–292, 2002.
 Grünwald et al. [2005] P. D. Grünwald, I. J. Myung, and M. A. Pitt, Advances in minimum description length: Theory and applications. MIT press, 2005.
 Schwarz et al. [1978] G. Schwarz et al., “Estimating the dimension of a model,” The Annals of Statistics, vol. 6, no. 2, pp. 461–464, 1978.
 Fonseca et al. [2008] T. C. Fonseca, M. A. Ferreira, and H. S. Migon, “Objective Bayesian analysis for the Studentt regression model,” Biometrika, vol. 95, no. 2, pp. 325–333, 2008.
 Rissanen [1978] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, no. 5, pp. 465–471, 1978.
 Makalic and Schmidt [2009] E. Makalic and D. F. Schmidt, “Minimum message length shrinkage estimation,” Statistics & Probability Letters, vol. 79, no. 9, pp. 1155–1161, 2009.
 Schmidt et al. [2016] D. F. Schmidt, E. Makalic, and J. L. Hopper, “Approximating message lengths of hierarchical bayesian models using posterior sampling,” in Australasian Joint Conference on Artificial Intelligence. Springer, 2016, pp. 482–494.
 Liu and Rubin [1995] C. Liu and D. B. Rubin, “ML estimation of the t distribution using EM and its extensions, ECM and ECME,” Statistica Sinica, pp. 19–39, 1995.
 Pianto [2010] D. M. Pianto, “A bootstrap estimator for the Studentt regression model,” 2010.
 Lin [1991] J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information theory, vol. 37, no. 1, pp. 145–151, 1991.
 Kullback and Leibler [1951] S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
 Scott et al. [2010] J. G. Scott, J. O. Berger et al., “Bayes and empiricalbayes multiplicity adjustment in the variableselection problem,” The Annals of Statistics, vol. 38, no. 5, pp. 2587–2619, 2010.
 Rissanen [2007] J. Rissanen, Information and complexity in statistical modeling. Springer Science & Business Media, 2007.
 Zhu and Kay [2018] Z. Zhu and S. Kay, “On bayesian exponentially embedded family for model order selection,” IEEE Transactions on Signal Processing, vol. 66, no. 4, pp. 933–943, 2018.
 Roos et al. [2005] T. Roos, P. Myllymäki, H. Tirri et al., “On the behavior of MDL denoising,” AISTATS 2005, 2005.
 Ding and Kay [2011] Q. Ding and S. Kay, “Inconsistency of the MDL: On the performance of model order selection criteria with increasing signaltonoise ratio,” IEEE Transactions on Signal Processing, vol. 59, no. 5, pp. 1959–1969, 2011.

Hurvich and Tsai [1993]
C. M. Hurvich and C.L. Tsai, “A corrected Akaike information criterion for vector autoregressive model selection,”
Journal of time series analysis, vol. 14, no. 3, pp. 271–279, 1993.  Tibshirani [1996] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society (Series B), vol. 58, pp. 267–288, 1996.
Comments
There are no comments yet.