Bayesian linear regression with Student-t assumptions

04/15/2016 ∙ by Chaobing Song, et al. ∙ 0

As an automatic method of determining model complexity using the training data alone, Bayesian linear regression provides us a principled way to select hyperparameters. But one often needs approximation inference if distribution assumption is beyond Gaussian distribution. In this paper, we propose a Bayesian linear regression model with Student-t assumptions (BLRS), which can be inferred exactly. In this framework, both conjugate prior and expectation maximization (EM) algorithm are generalized. Meanwhile, we prove that the maximum likelihood solution is equivalent to the standard Bayesian linear regression with Gaussian assumptions (BLRG). The q-EM algorithm for BLRS is nearly identical to the EM algorithm for BLRG. It is showed that q-EM for BLRS can converge faster than EM for BLRG for the task of predicting online news popularity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In a practical application of linear regression, in order to achieve the best prediction performance on new data, one must find a model with appropriate complexity parameters that governs model complexity. One popular method to tune complexity parameters is cross validation [1], which uses a proportion of the available data for training while making use of all the data to assess performance. However, the number of training runs that must be performed is proportional to , and it is proven to be problematic if the training is computational expensive or must be set to a large value because of the scarce of data. Furthermore, if a model has several complexity parameters, in the worst case, searching the combinations of the complexity parameters needs a number of training runs that is exponential to the number of parameters.

As an alternative to cross validation, one can turn to a Bayesian treatment of linear regression, which introduces prior probability distribution of weight parameters and noise. Then after marginalizing the weight parameters out, the hyperparameters of the prior probability distributions can be estimated by maximizing the probability of observations. Thus in Bayesian linear regression, only training data is needed to tune hyperparameters which correspond to complexity parameters. However, although a principled way to estimate hyperparameters is provided, but the established model from Bayesian perspective is often difficult to solve and in many cases stochastic or deterministic approximation techniques are often required. A rare exception is Bayesian linear regression with independent Gaussian prior distribution of weight parameters and noise (we abbreviate the model as “BLRG”), in which the posterior distribution can be expressed in a closed form and then the model can be inferred exactly by expectation maximization (EM) algorithm.

An interesting question is whether we can find more generalized models which still keeps a closed-form posterior distribution and therefore can be inferred exactly. In this paper, we give a positive answer. Observing the form of distribution, Gaussian distribution is a limit case of Student-t distribution for the degree of freedom

. In addition, from the perspective of nonextensive information theory [2], Student-t and Gaussian distribution both are the maximum Tsallis entropy distribution [3] of which the Gaussian distribution is a special case for entropy index . Based on these facts and the results from nonextensive information theory, we unify the inference process for the assumptions of Gaussian and Student-t distributions and propose a Bayesian linear regression model with Student-t assumptions (“BLRS”). The main contributions of this paper are

  • By introducing relevant noise whose variance is linear to the norm of weight parameters, we generalize the concept of conjugate prior to Student-t distribution, e.g, under the relevant noise setting, if the prior distribution of weight parameters is Student-t distribution, the posterior distribution is also a Student-t distribution.

  • We prove that the maximum likelihood solution of variance hyperparameters has no relation with the degrees of freedom and thus BLRS is equivalent to BLRG, which may be remarkable.

  • By applying Tsallis divergence instead of Kullback-Leibler (KL) divergence, EM is generalized to -EM to make an exact inference under the setting of Student-t distribution. Closed-form expressions are acquired in each iteration, which are nearly identical to the EM algorithm for BLRG.

  • By experiments on the task of predicting online news popularity, we show that BLRS and BLRG will converge to the same result. Meanwhile, BLRS with a finite constant can converge faster to BLRG on standard datasets. Therefore, can be seen as an accelerating factor for iterations.

In addition, when preparing this paper, we find that in terms of the general form, the -EM algorithm in [4] is equivalent to the -EM algorithm we proposed in this paper. But they still have differences. Firstly, -EM is mainly specified to solve the corresponding BLRS model, while -EM attempts to improve EM without changing model. Secondly, -EM is derived from nonextensive information theory, while -EM is derived from information geometry. Thirdly, the -EM algorithm is one part of our attempt to unify the treatment to Gaussian distribution and Student-t distribution, while -EM and its extensions are with independent interests [4].

Ii Notations and concepts from nonextensive information theory

Nonextensive information theory is proposed by Tsallis [5] and aims to model phenomenon such as long-range interactions and multifractals [2]

. Nonextensive information theory has recently been applied in machine learning

[6], [7]. In this paper, it is our main motivation for generalizing BLRG. In this section, we review some notations and concepts briefly.

For convenience [2], one can define the following -exponent and -logarithm,

where stands for and , . By its definition, one has

The -exponent and -logarithm has the following properties,

(1)
(2)

where (1) is used to formulate a generalized conjugate prior for Student-t distribution and (2) is used to generalize the EM algorithm.

For normalized probability distributions and on , from nonextensive information theory, Tsallis divergence [8] is defined as

At , denote , which is the definition of KL divergence.

For , is a special case of f-divergence (see [8] and reference therein), which has the following properties:

  • Convexity: is convex with respect to both and ;

  • Strict Positivity: and if and only if .

Because of the two useful properties for , the value of can be used to measure the similarity between and . In practice, one can make get close to as much as possible by minimizing with respect to . In the -EM algorithm, Tsallis divergence is used to measure similarity and alleviate the complexity of mathematical expressions.

Iii Bayesian linear regression with Gaussian assumptions

Given a set of pairs . is a group of fixed basis functions. Consider the linear model

where is the parameter we need to estimate, is an additive noise, with . In Bayesian linear regression with Gaussian assumptions (BLRG), is assumed to be an independent, zero-mean Gaussian noise,

(3)

where is the inverse variance. Meanwhile,

is assumed to be an independent zero-mean Gaussian distributed random variable, given by

where is the inverse variance of . Then the likelihood is

In this setting, by integrating out in and applying Bayesian theorem, one has

where

Fixed ,

is equivalent to the solution of ridge regression

[9] with regularization parameter . The hyperparamters can be optimized by maximum likelihood principle, e.g., maximizing with respect to ,

Gradient-based optimization method such as conjugate conditional method or quasi Newton method with active set strategy can be used to solve the above problem under the nonnegative constraints . However, there exist an elegant and powerful algorithm called expectation maximization (EM) algorithm to address this problem. [10]

concluded that the EM algorithm have the advantage of reliable global convergence, low cost per iteration, economy of storage and ease pf programming. Consider a general joint distribution

, where is observations, is hidden variables and is parameters needed to optimize. In order to optimize the evidence distribution , the general EM algorithm is executed in Alg. 1.

  1. Choose an initial setting for the parameters ;

  2. E Step: Evaluate ;

  3. M Step: Evaluate , given by

  4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let

    and return to step 2.

Algorithm 1 Expectation maximization

The concrete process of EM for BLRG is a special case of the -EM iteration for BLRS and will be discussed in Section VI.

Iv Generalized conjugate prior for Student-t distribution

Conjugate prior is an important concept in Bayesian inference. If the posterior distributions

are in the same family as the prior probability distribution , the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. Conjugate prior is often thought as a particular characteristic of exponential family [11]. From the view of mathematics, it is because Bayesian update is multiplicative. If the prior distribution or the likelihood distribution is not belonged to exponential family, the product of the two distributions, e.g, joint distribution will exist cross term in general, which makes the integral over be intractable. To alleviate this complexity, in Bayesian linear regression model, we assume that not only the expectation of the likelihood , but also the variance dependent on the weight parameters .

Firstly, we assume that is distributed as joint Student-t distribution

(4)

Then we assume that noise is distributed as

where the degree of freedom of is greater than that of and influences the variance of by a product factor . Then the likelihood

(5)

Multiply (4) by (5), the joint distribution is

(6)

In addition,

where

(7)
(8)
(9)

Then

(10)

and

(11)

where

(12)

and is given in (8).

Therefore, the posterior distribution is also a Student-t distribution with the increased degree of freedom , which represents the property of conjugate prior. Just like the case in Gaussian assumptions, all the distributions related above have closed forms. If , all the distributions will degraded to Gaussian distribution and thus the BLRG model is recovered.

In the above derivation, the product factor plays a key role for the closed-form distributions, which is the main difference from common model settings about Student-t assumptions [12]. We propose it according to (1), which is a generalization of the property of exponent. Combing the relevant noise, the property of conjugate prior is generalized naturally to Student-t distribution. Of course the thought of relevant noise can also be used to other kinds of distribution which is not belonged to exponential family, such as generalized Pareto distribution, which is beyond the scope of the paper.

V Maximum likelihood solution

In this section, we consider the determination of the model parameters of BLRS using maximum likelihood. According to type-II maximum likelihood principle, one need to maximize (10) with respect to

. Firstly, we assume that the eigenvalues decomposition of

is , where

is an orthogonal matrix,

is a diagonal matrix with all diagonal elements . Then

Denote . Then

In addition,

Consider is fixed. are two independent hyperparameters to optimize. Maximizing (10) with respect to is equivalent to minimizing

(13)

Set , one gets

(14)

where is determined by data and . Substitute (14) into (13), we have

(15)

For fixed , minimizing is equivalent to solve

(16)

where is determined by data. Combing (14) and (16), it is showed that whatever is, the maximum likelihood solution of (i.e., ) will be the same. For , by maximizing , (14) and (16) will also be acquired. Therefore we have Theorem 1.

Theorem 1.

For , the maximum likelihood solution of of maximizing the evidence distribution in (10) is only determined by data and is not influenced by the degrees of freedom .

It is generally accepted that Student-t distribution is a generalization of Gaussian distribution. Thus it is believed that the related conclusion about Student-t distribution is some kind of generalization of the corresponding result about Gaussian distribution. Concretely, it is to say that, the related result about Student-t distribution should depend on the distribution parameter ; meanwhile, for , the result will degraded to the corresponding result about Gaussian distribution. But Theorem 1 shows that the maximum likelihood solution of has no relation with the distribution parameter . It has the following meanings:

  • One pair of the maximum likelihood solution corresponds to the evidence distribution with arbitrary . Therefore, even if BLRG selects suitable hyperparameters and thus the resulted parameter fits model well, we still cannot assert that the observed data is generated from independent Gaussian distribution with independent Gaussian noise.

  • Because BLRG and BLRS are special cases of Gaussian process and t process respectively and has the same maximum likelihood solution of , Theorem 1 explains the conclusion in [13] that “ t process is perhaps not as exciting as one might have hoped”.

  • If the parameter is only used for point estimation, can be fixed to an arbitrary positive value or for computational consideration.

Vi -Expectation maximization

Vi-a The general -EM algorithm

In the general EM algorithm Alg. 1, the logarithm of is splitted into two KL divergences. In the -EM algorithm, we generalize the property to -logarithm. From (2), one has

Then combing and introducing a variational distribution , we have

(17)

where , . For , (17) degrades to

(18)

Just like the standard EM-step [14], [12], but using -logarithm and Tsallis divergence instead of logarithm and KL divergence, in E-Step, is optimized with respect to , given by

Then in M-Step, is optimized with respect to , given by

Corresponding to Alg. 1, the -EM algorithm is summarized in Alg. 2.

  1. Choose an initial setting for the parameters ;

  2. E Step: Evaluate ;

  3. M Step: Evaluate , given by

  4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let

    and return to step 2.

Algorithm 2 -Expectation maximization

Compare Alg. 1 and Alg. 2, it is showed that the only difference is that Tsallis divergence is used instead KL divergence in Step .

Similar to EM [12], in each iteration, will not decrease. Therefore, -EM is also a local minimizer.

Vi-B Inference and optimization with Student-t assumptions

In this paper, we mainly use the mean of for point estimation, which is determined only by . Therefore, by Theorem 1, can be set to a fixed positive value or for computational consideration. In the following derivation, is set to a positive constant. The case can be acquired by taking limit.

Before iteration, in Step , we should specify some initial values of as . Then we apply -EM to optimize the hyperparameters .

Then in Step , by (11), is a Student-t distribution, thus evaluating is equivalent to evaluating its values of distribution parameters given in (7), (8), (9) and (12) with .

Then in Step , set

which means if is a positive constant. For , we set . and denote

where is constant for given . Then is a normalized Student-t distribution with the degree of freedom . It is deserved to note that only if , the mean and covariance of are both exist. In addition,

(19)
(20)
(21)

In (6), denote

By the definition of Tsallis divergence, for , maximizing with respect to is equivalent to minimizing given by

Denote

(22)

Set and , one has

Therefore,

By the fact

after arrangement, we have

(24)

In (24), it is interesting to see that given , the final update formulation of is identical to the EM update for BLRG [12]. The only difference is in the E step, where the covariance in (20) has a product factor , which will tend to for . Summarizing the above steps, we give the concrete -EM iteration for BLRS in Alg. 3.

  1. Choose an initial setting for the parameters as ;

  2. E Step: Evaluate and by (7), (8), (9), (12) , (19), (20) with ;

  3. M Step: Evaluate given by (24).

  4. Check for convergence of either the log likelihood or the parameter values. If the convergence criterion is not satisfied, then let

    and return to step 2.

Algorithm 3 -EM for BLRS

In many regression tasks, the number of features are much less than the number of training samples, i.e., . In order to reduce the computational complexity in each step as much as possible, we reformulate the steps in Alg. 3. Before iteration, we give eigenvalues decomposition of as

(25)

where the time cost is and compute

(26)

where the time costs are , , and respectively. Then in each iteration,

(27)

where the time cost is . Then

(28)

where based on the result (26), the time cost is . Then

(29)

where based on the result (28), the time cost is . Then

(30)

where based on the result (28), the time cost is . In addition

(31)

where based on the result (26), the time cost is .

Therefore, combing (27) and (29), the time cost of computing in (22) is ; combing (30) and (31), the time cost of computing is . For the BLRG case , , thus we can neglect the computation of