I Introduction
In a practical application of linear regression, in order to achieve the best prediction performance on new data, one must find a model with appropriate complexity parameters that governs model complexity. One popular method to tune complexity parameters is cross validation [1], which uses a proportion of the available data for training while making use of all the data to assess performance. However, the number of training runs that must be performed is proportional to , and it is proven to be problematic if the training is computational expensive or must be set to a large value because of the scarce of data. Furthermore, if a model has several complexity parameters, in the worst case, searching the combinations of the complexity parameters needs a number of training runs that is exponential to the number of parameters.
As an alternative to cross validation, one can turn to a Bayesian treatment of linear regression, which introduces prior probability distribution of weight parameters and noise. Then after marginalizing the weight parameters out, the hyperparameters of the prior probability distributions can be estimated by maximizing the probability of observations. Thus in Bayesian linear regression, only training data is needed to tune hyperparameters which correspond to complexity parameters. However, although a principled way to estimate hyperparameters is provided, but the established model from Bayesian perspective is often difficult to solve and in many cases stochastic or deterministic approximation techniques are often required. A rare exception is Bayesian linear regression with independent Gaussian prior distribution of weight parameters and noise (we abbreviate the model as “BLRG”), in which the posterior distribution can be expressed in a closed form and then the model can be inferred exactly by expectation maximization (EM) algorithm.
An interesting question is whether we can find more generalized models which still keeps a closedform posterior distribution and therefore can be inferred exactly. In this paper, we give a positive answer. Observing the form of distribution, Gaussian distribution is a limit case of Studentt distribution for the degree of freedom
. In addition, from the perspective of nonextensive information theory [2], Studentt and Gaussian distribution both are the maximum Tsallis entropy distribution [3] of which the Gaussian distribution is a special case for entropy index . Based on these facts and the results from nonextensive information theory, we unify the inference process for the assumptions of Gaussian and Studentt distributions and propose a Bayesian linear regression model with Studentt assumptions (“BLRS”). The main contributions of this paper are
By introducing relevant noise whose variance is linear to the norm of weight parameters, we generalize the concept of conjugate prior to Studentt distribution, e.g, under the relevant noise setting, if the prior distribution of weight parameters is Studentt distribution, the posterior distribution is also a Studentt distribution.

We prove that the maximum likelihood solution of variance hyperparameters has no relation with the degrees of freedom and thus BLRS is equivalent to BLRG, which may be remarkable.

By applying Tsallis divergence instead of KullbackLeibler (KL) divergence, EM is generalized to EM to make an exact inference under the setting of Studentt distribution. Closedform expressions are acquired in each iteration, which are nearly identical to the EM algorithm for BLRG.

By experiments on the task of predicting online news popularity, we show that BLRS and BLRG will converge to the same result. Meanwhile, BLRS with a finite constant can converge faster to BLRG on standard datasets. Therefore, can be seen as an accelerating factor for iterations.
In addition, when preparing this paper, we find that in terms of the general form, the EM algorithm in [4] is equivalent to the EM algorithm we proposed in this paper. But they still have differences. Firstly, EM is mainly specified to solve the corresponding BLRS model, while EM attempts to improve EM without changing model. Secondly, EM is derived from nonextensive information theory, while EM is derived from information geometry. Thirdly, the EM algorithm is one part of our attempt to unify the treatment to Gaussian distribution and Studentt distribution, while EM and its extensions are with independent interests [4].
Ii Notations and concepts from nonextensive information theory
Nonextensive information theory is proposed by Tsallis [5] and aims to model phenomenon such as longrange interactions and multifractals [2]
. Nonextensive information theory has recently been applied in machine learning
[6], [7]. In this paper, it is our main motivation for generalizing BLRG. In this section, we review some notations and concepts briefly.For convenience [2], one can define the following exponent and logarithm,
where stands for and , . By its definition, one has
The exponent and logarithm has the following properties,
(1)  
(2)  
where (1) is used to formulate a generalized conjugate prior for Studentt distribution and (2) is used to generalize the EM algorithm.
For normalized probability distributions and on , from nonextensive information theory, Tsallis divergence [8] is defined as
At , denote , which is the definition of KL divergence.
For , is a special case of fdivergence (see [8] and reference therein), which has the following properties:

Convexity: is convex with respect to both and ;

Strict Positivity: and if and only if .
Because of the two useful properties for , the value of can be used to measure the similarity between and . In practice, one can make get close to as much as possible by minimizing with respect to . In the EM algorithm, Tsallis divergence is used to measure similarity and alleviate the complexity of mathematical expressions.
Iii Bayesian linear regression with Gaussian assumptions
Given a set of pairs . is a group of fixed basis functions. Consider the linear model
where is the parameter we need to estimate, is an additive noise, with . In Bayesian linear regression with Gaussian assumptions (BLRG), is assumed to be an independent, zeromean Gaussian noise,
(3) 
where is the inverse variance. Meanwhile,
is assumed to be an independent zeromean Gaussian distributed random variable, given by
where is the inverse variance of . Then the likelihood is
In this setting, by integrating out in and applying Bayesian theorem, one has
where
Fixed ,
is equivalent to the solution of ridge regression
[9] with regularization parameter . The hyperparamters can be optimized by maximum likelihood principle, e.g., maximizing with respect to ,Gradientbased optimization method such as conjugate conditional method or quasi Newton method with active set strategy can be used to solve the above problem under the nonnegative constraints . However, there exist an elegant and powerful algorithm called expectation maximization (EM) algorithm to address this problem. [10]
concluded that the EM algorithm have the advantage of reliable global convergence, low cost per iteration, economy of storage and ease pf programming. Consider a general joint distribution
, where is observations, is hidden variables and is parameters needed to optimize. In order to optimize the evidence distribution , the general EM algorithm is executed in Alg. 1.
Choose an initial setting for the parameters ;

E Step: Evaluate ;

M Step: Evaluate , given by

Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let
and return to step 2.
The concrete process of EM for BLRG is a special case of the EM iteration for BLRS and will be discussed in Section VI.
Iv Generalized conjugate prior for Studentt distribution
Conjugate prior is an important concept in Bayesian inference. If the posterior distributions
are in the same family as the prior probability distribution , the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. Conjugate prior is often thought as a particular characteristic of exponential family [11]. From the view of mathematics, it is because Bayesian update is multiplicative. If the prior distribution or the likelihood distribution is not belonged to exponential family, the product of the two distributions, e.g, joint distribution will exist cross term in general, which makes the integral over be intractable. To alleviate this complexity, in Bayesian linear regression model, we assume that not only the expectation of the likelihood , but also the variance dependent on the weight parameters .Firstly, we assume that is distributed as joint Studentt distribution
(4)  
Then we assume that noise is distributed as
where the degree of freedom of is greater than that of and influences the variance of by a product factor . Then the likelihood
(5)  
Multiply (4) by (5), the joint distribution is
(6)  
In addition,
where
(7)  
(8)  
(9) 
Then
(10)  
and
(11)  
where
(12) 
and is given in (8).
Therefore, the posterior distribution is also a Studentt distribution with the increased degree of freedom , which represents the property of conjugate prior. Just like the case in Gaussian assumptions, all the distributions related above have closed forms. If , all the distributions will degraded to Gaussian distribution and thus the BLRG model is recovered.
In the above derivation, the product factor plays a key role for the closedform distributions, which is the main difference from common model settings about Studentt assumptions [12]. We propose it according to (1), which is a generalization of the property of exponent. Combing the relevant noise, the property of conjugate prior is generalized naturally to Studentt distribution. Of course the thought of relevant noise can also be used to other kinds of distribution which is not belonged to exponential family, such as generalized Pareto distribution, which is beyond the scope of the paper.
V Maximum likelihood solution
In this section, we consider the determination of the model parameters of BLRS using maximum likelihood. According to typeII maximum likelihood principle, one need to maximize (10) with respect to
. Firstly, we assume that the eigenvalues decomposition of
is , whereis an orthogonal matrix,
is a diagonal matrix with all diagonal elements . ThenDenote . Then
In addition,
Consider is fixed. are two independent hyperparameters to optimize. Maximizing (10) with respect to is equivalent to minimizing
(13)  
Set , one gets
(14) 
where is determined by data and . Substitute (14) into (13), we have
(15)  
For fixed , minimizing is equivalent to solve
(16) 
where is determined by data. Combing (14) and (16), it is showed that whatever is, the maximum likelihood solution of (i.e., ) will be the same. For , by maximizing , (14) and (16) will also be acquired. Therefore we have Theorem 1.
Theorem 1.
For , the maximum likelihood solution of of maximizing the evidence distribution in (10) is only determined by data and is not influenced by the degrees of freedom .
It is generally accepted that Studentt distribution is a generalization of Gaussian distribution. Thus it is believed that the related conclusion about Studentt distribution is some kind of generalization of the corresponding result about Gaussian distribution. Concretely, it is to say that, the related result about Studentt distribution should depend on the distribution parameter ; meanwhile, for , the result will degraded to the corresponding result about Gaussian distribution. But Theorem 1 shows that the maximum likelihood solution of has no relation with the distribution parameter . It has the following meanings:

One pair of the maximum likelihood solution corresponds to the evidence distribution with arbitrary . Therefore, even if BLRG selects suitable hyperparameters and thus the resulted parameter fits model well, we still cannot assert that the observed data is generated from independent Gaussian distribution with independent Gaussian noise.

If the parameter is only used for point estimation, can be fixed to an arbitrary positive value or for computational consideration.
Vi Expectation maximization
Via The general EM algorithm
In the general EM algorithm Alg. 1, the logarithm of is splitted into two KL divergences. In the EM algorithm, we generalize the property to logarithm. From (2), one has
Then combing and introducing a variational distribution , we have
(17)  
where , . For , (17) degrades to
(18)  
Just like the standard EMstep [14], [12], but using logarithm and Tsallis divergence instead of logarithm and KL divergence, in EStep, is optimized with respect to , given by
Then in MStep, is optimized with respect to , given by
Corresponding to Alg. 1, the EM algorithm is summarized in Alg. 2.

Choose an initial setting for the parameters ;

E Step: Evaluate ;

M Step: Evaluate , given by

Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let
and return to step 2.
Compare Alg. 1 and Alg. 2, it is showed that the only difference is that Tsallis divergence is used instead KL divergence in Step .
Similar to EM [12], in each iteration, will not decrease. Therefore, EM is also a local minimizer.
ViB Inference and optimization with Studentt assumptions
In this paper, we mainly use the mean of for point estimation, which is determined only by . Therefore, by Theorem 1, can be set to a fixed positive value or for computational consideration. In the following derivation, is set to a positive constant. The case can be acquired by taking limit.
Before iteration, in Step , we should specify some initial values of as . Then we apply EM to optimize the hyperparameters .
Then in Step , by (11), is a Studentt distribution, thus evaluating is equivalent to evaluating its values of distribution parameters given in (7), (8), (9) and (12) with .
Then in Step , set
which means if is a positive constant. For , we set . and denote
where is constant for given . Then is a normalized Studentt distribution with the degree of freedom . It is deserved to note that only if , the mean and covariance of are both exist. In addition,
(19)  
(20)  
(21) 
In (6), denote
By the definition of Tsallis divergence, for , maximizing with respect to is equivalent to minimizing given by
Denote
(22)  
Set and , one has
Therefore,
By the fact
after arrangement, we have
(24)  
In (24), it is interesting to see that given , the final update formulation of is identical to the EM update for BLRG [12]. The only difference is in the E step, where the covariance in (20) has a product factor , which will tend to for . Summarizing the above steps, we give the concrete EM iteration for BLRS in Alg. 3.

Choose an initial setting for the parameters as ;

M Step: Evaluate given by (24).

Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let
and return to step 2.
In many regression tasks, the number of features are much less than the number of training samples, i.e., . In order to reduce the computational complexity in each step as much as possible, we reformulate the steps in Alg. 3. Before iteration, we give eigenvalues decomposition of as
(25) 
where the time cost is and compute
(26) 
where the time costs are , , and respectively. Then in each iteration,
(27)  
where the time cost is . Then
(28)  
where based on the result (26), the time cost is . Then
(29)  
where based on the result (28), the time cost is . Then
(30)  
where based on the result (28), the time cost is . In addition
(31) 
where based on the result (26), the time cost is .
Comments
There are no comments yet.