Regression modelling with I-priors

07/30/2020 ∙ by Wicher Bergsma, et al. ∙ 0

We introduce the I-prior methodology as a unifying framework for estimating a variety of regression models, including varying coefficient, multilevel, longitudinal models, and models with functional covariates and responses. It can also be used for multi-class classification, with low or high dimensional covariates. The I-prior is generally defined as a maximum entropy prior. For a regression function, the I-prior is Gaussian with covariance kernel proportional to the Fisher information on the regression function, which is estimated by its posterior distribution under the I-prior. The I-prior has the intuitively appealing property that the more information is available on a linear functional of the regression function, the larger the prior variance, and the smaller the influence of the prior mean on the posterior distribution. Advantages compared to competing methods, such as Gaussian process regression or Tikhonov regularization, are ease of estimation and model comparison. In particular, we develop an EM algorithm with a simple E and M step for estimating hyperparameters, facilitating estimation for complex models. We also propose a novel parsimonious model formulation, requiring a single scale parameter for each (possibly multidimensional) covariate and no further parameters for interaction effects. This simplifies estimation because fewer hyperparameters need to be estimated, and also simplifies model comparison of models with the same covariates but different interaction effects; in this case, the model with the highest estimated likelihood can be selected. Using a number of widely analyzed real data sets we show that predictive performance of our methodology is competitive. An R-package implementing the methodology is available (Jamil, 2019).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Outline

Consider a sample , where is a real-valued measurement on unit , and

is a row vector of

covariates, where each each belongs to some set and may for example be real, categorical, multidimensional, or functional. To describe the dependence of the on the , we consider the regression model

(1)

where

is a space of functions. We assume the errors have a multivariate normal distribution, i.e.,

(2)

where is an positive definite precision matrix. Here, is taken to be known up to a low dimensional parameter, e.g., (, the identity matrix), reflecting iid errors.

The function is assumed to be partitioned into a sum of main effects and possible interactions. An example for is

(3)

and for ,

Here, each of the may be, for example, scalar, categorical, Euclidean, or functional. As we explain, this setup includes multilevel, varying coefficient, and longitudinal models.

The intercept is constant and we assume that the main effect functions lie in a reproducing kernel Hilbert space (RKHS) of functions over a covariate space . The functions ,

, etc. describing interaction effects are assumed to lie in the tensor product space of the corresponding main effect function spaces. An RKHS over

possesses a positive definite kernel , where . Let us give two simple examples where . If , it can easily be shown that (3) reduces to the form

for some parameters. With this , (1) is a standard multiple regression model with an interaction. If and (the covariance kernel for a Brownian motion), we obtain

where and are functions with a square integrable derivative and . The resulting regression model is called a varying coefficient model Hastie  Tibshirani (1986), where is the varying coefficient for . Note the methodology of this paper allows extensions to multidimensional and , and additional covariates and interactions. In Section 3.4 further examples including multilevel models and functional response models are described.

Each kernel is multiplied by a scale parameter which may be negative. If one or more of the scale parameters are negative, the resulting kernel for the space of regression functions is indefinite. equipped with an indefinite kernel defines a reproducing kernel Krein space (RKKS). For the purposes of this paper, restricting the scale parameters to be positive would be arbitrarily restrictive, and hence we adopt the RKKS framework. If , however, the model only has an intercept and a main effect, so there is only one scale parameter and the RKHS framework suffices, as described in bergsma20.

The I-prior for in (1) subject to (2) where is an RKKS is Gaussian with covariance kernel the Fisher information for . If has reproducing kernel , then the Fisher information between and is given as

Hence, follows an I-prior distribution if it can be written in the form

(4)

where is the prior mean (typically set to zero), and

(5)

As we show, the I-prior has a maximum entropy interpretation. The I-prior methodology for the case , when as mentioned the RKHS framework suffices, was described in some detail by bergsma20.

An intuitively attractive property of the I-prior is that if much information about a linear functional of (e.g., a regression coefficient) is available, its prior variance is large, and the data have a relatively large influence on the posterior, while if little information about a linear functional is available, the posterior will be largely determined by the prior mean, which serves as a ‘best guess’ of . The I-prior methodology consists of estimation of the regression function by its posterior distribution under the I-prior, where we take the posterior mean as the summary measure.

For simplicity three main classes of RKHSs will be used in this paper, allowing linear and smooth effects of Euclidean and functional covariates on a reponse, as well as the incorporation of categorical covariates: the canonical RKHS, consisting of linear functions of the covariates; the fractional Brownian motion (FBM) RKHS, consisting of smooth functions of the covariates; and the canonical or the Pearson RKHS for nominal categorical covariates. The FBM RKHS has smoothness parameter , called the Hurst coefficient.

Apart from the I-prior methodology, this paper introduces a second innovation, namely a parsimonious specification of models with interaction effects (see Section 3.3). This idea is very simple but as far as we are aware has not been adopted before, and can also be applied in the context of Tikhonov regularization or Gaussian process regression. In particular, we only use a single scale parameter for each covariate, and no further parameters are needed for interaction effects. This is in contrast with the usual approach for regularization, Gaussian process regression, or random effects modelling, where separate scale parameters are assigned to each interaction effect. Our parsimonious approach greatly simplifies the estimation of models with interaction effects. In addition, it allows a semi-Bayes approach to the selection of interaction effects, potentially able to detect effects with smaller sample sizes than with existing approaches. That is, model selection among models with the same main effects can be done simply be choosing the model with the highest estimated marginal likelihood. Examples are given in Sections 6.3 and 6.5.

1.2 Comparison with existing methods

The three most popular existing techniques for estimating in (1) subject to (2) are maximum likelihood (ML – equivalent to generalized least squares), Tikhonov regularization and Gaussian process regression (GPR). For these techniques the RKKS framework is not needed and the assumption is typically made that is an RKHS. (Note that for every RKKS there is a corresponding RKHS consisting of the same set of functions, and vice versa). Maximum likelihood estimation is equivalent to minimizing the generalized least squares criterion

in terms of . It is only suitable if the dimension of is small compared to the sample size , and Tikhonov regularization and GPR are much more generally applicable. The Tikhonov regularizer minimizes the penalized generalized least squares functional

where is the RKHS norm of and a positive smoothing parameter. However, the Tikhonov regularizer has the drawback that it is inadmissible with respect to squared error loss Chakraborty  Panaretos (2019), that is, there exist other estimators which perform better for any true . In GPR on the other hand, the user defines a Gaussian process prior for whose support is contained in , and is estimated by its posterior distribution. By Wald’s complete class theorem, such a GPR estimator is admissible. Note that, since the regression coefficients are assumed to have a multivariate normal distribution, random effects modelling as used in multilevel analysis is an instance of GPR. Note also that in the GPR framework, need not be an RKHS, for example, vz08rates develop theoretical results for the case that is a Banach space.

The I-prior methodology has some advantages compared to competing methods, in particular, it has the following properties:

  1. Since the support of the I-prior is contained in , the posterior distribution of

    under the I-prior is admissible under a broad range of loss functions.

  2. The I-prior is automatic, in the sense that once the kernel for has been chosen, no further user input is needed.

  3. An EM algorithm with simple E and M steps for finding the maximum likelihood estimators of the scale parameters of the kernel is available (see Section 5).

The first property gives I-prior estimators an advantage over Tikhonov regularizers, which as mentioned before are inadmissible. The second property gives the I-prior methodology an advantage over GPR, which in addition to a metric over requires the user to specify a prior. The third property gives the I-prior methodology an advantage over both Tikhonov regularization and GPR, for which no simple algorithms are available to estimate scale parameters, which is potentially problematic if there are multiple covariates.

For some commonly used models our methodology has some additional advantages. For example, in the standard approach to multilevel modelling, a multivariate normal distribution is assumed for the regression parameters, which can then be called random effects, and it is necessary to estimate of a latent covariance matrix for these random effects. If there are more than a few covariates this can be problematic due to the positive definiteness constraint and number of parameters to be estimated. In our approach, a smaller number of scale parameters needs to be estimated and no range restrictions need to be taken into account (see Sections 3.4.2 and 6.3 for more details).

Estimation of kernel hyperparameters (such as the length scale parameter for the squared exponential kernel) is not easier with I-priors than with other methods. However, for many practical applications using the I-prior methodology, it suffices to use for each covariate a kernel which does not have hyperparameters (see Section 

2.2 and Table 1). When modelling smooth relations between a response and a covariate using I-priors, the one-dimensional or multidimensional Brownian motion RKHS, which consists of functions possessing a directional derivative, is particularly attractive. In particular, I-prior estimators for this RKHS essentially generalize the popular cubic spline smoothers in one dimension (see bergsma20 for a detailed discussion).

1.3 Relation with other work

The present paper complements bergsma20, which covers the case of a single, possibly multidimensional covariate. In this case there is a single scale parameter, and the RKHS framework suffices. In that paper, more details are given on the I-prior derivation, and generalization of I-priors to a broad class of statistical models is given. The relation with competing methods is outlined, including -priors, Jeffreys and reference priors, and Fisher kernels. A detailed comparison with Tikhonov regularization is given, with particular detail on the relation with cubic spline smoothing. It is explained in detail how I-priors work when the regression functions are linear, or when they are assumed to lie in the FBM RKHS, which is a particularly attractive RKHS for I-prior modelling.

jamil18 provides a number of extensions to the present methodology, including probit and logit models using a fully Bayes approach, Bayesian variable selection using I-priors, and Nyström approximations for speeding up the I-prior methodology. Furthermore, he contributed a user friendly R package

iprior Jamil (2019), further described in jb19.

ong04 previously used RKKSs in the context of regularization. In particular, they considered a regularization framework, where the usual RKHS squared penalty norm is replaced by the RKKS indefinite inner product . As the latter may be negative, it does not make sense to minimize the “penalized” loss function, and instead they sought a saddle point. Their approach is very different from ours, firstly in that they considered very different RKKSs, and secondly by constructing a Gaussian prior over the RKKS the indefiniteness of the inner product becomes irrelevant.

1.4 Overview of paper

In Section 2, a summary of existing theory of RKHSs and RKKSs is given as needed for this paper. In Section 3, we describe the construction of RKKSs over product spaces. To illustrate their use in regression modelling, We describe how a number of well-known models, such as the varying intercept model, one-dimensional smoothing, and multidimensional (or functional) response models can be described using the RKKS framework. In Section 4, the I-prior is defined and its representation (4) is derived for model (1) with multivariate normal errors. In Section 5, the EM algorithm for estimating scale parameters is described. In Section 6, we apply the I-prior methodology to a number of data examples in the respective areas of multilevel modelling, functional data analysis, classification and longitudinal data analysis, illustrating some possible advantages over existing techniques, and showing competitive predictive performance.

2 Function spaces with reproducing kernels

This section summarizes existing theory as needed for this paper. In Section 2.1 we give the definition and some well-known basic properties of RKHSs and RKKSs. Section 2.2 briefly lists the RKHSs used in this paper. These RKHSs are used as building blocks to construct RKKSs over product spaces, called ANOVA RKKSs, which is the topic of the next Section 3. The RKHSs we use in this paper will normally be centered, i.e., the functions in the RKHS have zero mean, which is formally described in Section 2.3.

2.1 Definitions and basic properties

The first comprehensive treatment of RKHSs was given by aronszajn50, and their usefulness for statistics was initially demonstrated by parzen61 and further developed by kw70bayes. Some more recent overviews of RKHS theory with a view to application in statistics and machine learning are wahba90, bt04, [Chapter 4]sc08 and hss08. schwartz64 developed a general theory of Hilbertian subspaces of topological vector spaces which includes the theory of RKKSs. The first applications to statistics and machine learning of RKKSs were given by ong04 and canu09. A recent technical survey of the theory of RKKSs is given by gheondea13. Below, we give a very brief overview of the theory as needed for this paper, more details can be found in the aforementioned literature.

We begin with the definition of the (possibly indefinite or negative definite) inner product.

Definition 1.

Let be a vector space over the reals. A function is called an inner product on if, for all ,

  • (symmetry)

  • (linearity)

  • (nondegeneracy)

If for all , the inner product is called positive definite and is called a norm on . If for all , the inner product is called negative definite. An inner product which is neither positive definite nor negative definite is called indefinite.

Recall that a Hilbert space is a complete inner product space with a positive definite inner product. The more general notion of Krein space is defined as follows.

Definition 2.

A vector space equipped with the inner product is called a Krein space if there are two Hilbert spaces and spanning such that

  • All can be decomposed as where and .

  • For all ,

Note that any Hilbert space is a Krein space, which can be seen by taking .

We next define the notion of a reproducing kernel:

Definition 3.

Let be a Krein space of functions over a set . A symmetric function is a reproducing kernel of if and only if

  • for all

  • for all and .

A Hilbert space resp. Krein space is called a reproducing kernel Hilbert space (RKHS) resp. reproducing kernel Krein space (RKKS) if it possesses a reproducing kernel. Sometimes in this paper we will use the shorthand ‘kernel’ to refer to ‘reproducing kernel’.

A function is said to be positive definite on if for all scalars and all . From the definition of positive definite inner products it follows that the reproducing kernel of an RKHS is symmetric and positive definite. The reproducing kernel of an RKKS can be shown to be the difference of two positive definite kernels so need not be positive definite. It follows that an inner product in an RKKS is the difference of two positive definite inner products

We define the norm of as the norm in the corresponding RKHS, i.e.,

(6)

The Moore-Aronszajn theorem states that every symmetric positive definite function defines a unique RKHS. Every RKKS also has a unique kernel, but a given kernel may have more than one RKKS associated with it (e.g., alpay91).

2.2 Some useful RKHSs

Below we describe some RKHSs that we will use in this paper. A summary is given in Table 1.

RKHS Functions Kernel Centered kernel
Any set Constant Constant functions 1 -
Finite set Canonical All functions
Finite set Pearson All zero mean functions
Hilbert space Canonical
Mahalanobis
Brownian motion Eq. (8) ()
Hilbert space Brownian motion Hölder Eq. (8) ()
Hilbert space FBM- Hölder Eq. (8)
Table 1: List of RKHSs. Here, is the Kronecker delta, is the proportion of the sample equal to , and is the sample covariance matrix. Note that only the kernel for the FBM- has a hyperparameter.

2.2.1 RKHS of constant functions

The RKHS of constant functions with reproducing kernel given by . For a constant function with , . (The RKHS of constant functions will be an essential component in the construction of RKKSs over product spaces in Section 3.2.)

2.2.2 RKHSs over finite sets

Let be a finite set. The canonical RKHS over is the RKHS whose kernel is the Kronecker delta function, i.e., consists of the set of all functions , with squared norm

Note that, viewing as a -dimensional vector, the canonical RKHS over is just standard Euclidean space.

Alternatively, the Pearson RKHS over a finite probability space

is defined as the RKHS with reproducing kernel

and consists of all functions with and

(7)

See jamil18 for a proof. Potential advantages of the Pearson RKHS compared to the canonical RKHS is that, due to the weighting with , the norm of is less sensitive to collapsing of categories, and for with zero probability mass do not contribute to the norm.

2.2.3 RKHSs over Hilbert spaces

Let be a Hilbert space with inner product . The canonical RKHS over is defined as its continuous dual space whose reproducing kernel is given by

Functions in this space are of the form , with norm .

A special case is the Mahalanobis RKHS, defined as the canonical RKHS over equipped with the Mahalanobis inner product; for a covariance matrix , it is defined as

The Brownian motion RKHS is defined as the RKHS over whose reproducing kernel is the generalized Brownian motion covariance kernel

Functions in the Brownian motion RKHS are Hölder of degree at least 1/2 (see bergsma20 for a proof). In the simplest nontrivial case, , and the RKHS consists of functions with a square integrable derivative, whose norm is the norm of the derivative, i.e., every can be written as for some square integrable , and has norm .

The fractional Brownian motion (FBM) RKHS is the RKHS whose reproducing kernel is the generalized FBM covariance kernel

Functions in the FBM- RKHS are Hölder of degree at least (see bergsma20 for a proof).

2.3 Centering of an RKKS

We say a function space over is centered with respect to a data set if

It can be verified that an RKKS is centered if and only if its kernel is centered, in the sense that for all . If is a kernel, then defined as follows is centered:

Table 1 gives a list of kernels discussed in Section 2.2 and their centered versions (see Appendix B for the derivation of the centered canonical kernel over a finite set). The centered FBM RKKS has kernel

(8)

where the Brownian motion RKHS is obtained if .

3 Construction of RKKSs over product spaces

ANOVA constructions of RKKSs over product spaces are a natural tool for formulating regression models, and generalize ANOVA RKHSs which were introduced for this purpose by wahba90anova and gw93. In Section 3.2 we describe ANOVA RKKSs, an immediate extension of ANOVA RKHSs which are needed in this paper. In Section 3.3, we describe what as far as we are aware is a novel approach to use scale parameters parsimoniously in the ANOVA construction. In Section 3.4 we show how the framework is useful in regression, as, for example, it can be used to easily formulate multilevel and varying coefficient models.

3.1 Illustrative example

As a very simple example, consider the set of functions of the form

equipped with the inner product of and

where the s are real-valued scale parameters. If the s are nonnegative, the inner product is positive definite and is an ANOVA RKHS. If at least one of the s is negative, the inner product is indefinite, and is an ANOVA RKKS. If a regression function in is estimated using the least squares method, the inner product is not needed. However, in high dimensions. the least squares method leads to overfitting, and an inner product is needed to be able to estimate

3.2 ANOVA RKKSs

An ANOVA decomposition of a function over product space is given by

where the components are orthogonal in some way. To formalize this, let us first define the tensor product of RKHSs. Let and by two RKHSs over resp. . For and , the tensor product is defined by . The tensor product of and is denoted as and is defined as the closure of the set of functions equipped with the inner product

The tensor product of RKKSs is defined analogously, the closure being defined with respect to the corresponding positive definite inner product.

Let be the RKHS of constant functions over with kernel and let be an RKKS over with kernel (). An ANOVA RKKS over is given as

with kernel given by

In this paper we assume the components are centered relative to data (Section 2.3). Hence, the and are orthogonal in the sense that

for any and .

With covariates, the ANOVA model with all interactions can be written succinctly as

(9)

with reproducing kernel

(10)

More general ANOVA kernels are described in Appendix A.

3.3 Scale parameters for kernels

In practice, the length of a vector in an RKHSs is measured on an arbitrary scale, and this can be taken into account by multiplying the kernel by a real-valued scale parameter which is to be estimated. In the ANOVA case, we can use component kernels and for real-valued and , giving the ANOVA kernel

This expression is overparameterized, and setting and (assuming ), we obtain the identified parameterization

(11)

Typically, the kernels will be positive definite, so that the corresponding are RKHSs. Then if at least one of the lambda parameters is negative, a function space with as its kernel will be an RKKS.

In the literature, a less parsimonious construction than (11) has been used, namely

(12)

(e.g., wahba90, Section 10.2, bt04, Section 10.2, gu13, Section 2.4.5). We refer to the corresponding RKKS as the extended ANOVA RKKS. Here, each of the four terms has a separate scale parameter, and is thus less parsimonious than our approach which only requires three scale parameters. For models with all interactions, our approach has scale parameters, while parameters are required if every interaction is assigned a separate parameter.

3.4 Application in regression modelling

We show how some well-known models which can be written in the form (1) can be formulated using the ANOVA function space construction. In Sections 3.4.1 and 3.4.2 we consider models with one and two covariates respectively (see also Tables 2 and 3). In Section 3.4.3 we consider functional responses, and in Section 3.4.4 we consider multi-class classification models.

3.4.1 Models with one covariate

Model: , ,
RKHS Model name Usual notation
A finite set Pearson/Canonical One-way ANOVA/Varying intercept model
Canonical Simple regression
FBM Smoothing spline model
An RKHS Canonical

Functional linear regression

An RKHS FBM Smooth functional regression
Table 2: Some models with one, possibly multidimensional covariate

We consider examples of regression models (1) where is of the form (9) with . We can write

(13)

where is in the centered RKHS over a set with kernel . Recall that the centering means for all .

If is a finite set, then (13) is known as a one-way ANOVA model or varying intercept model. The usual notation for this model is

(14)

where is the th -value for which . Suitable RKHSs are the canonical for which or the Pearson for which , where is the number of s equal to (Section 2.3).

If is a subset of a Hilbert space, a flexible range of models is obtained by taking to be the canonical RKHS, the Brownian motion RKHS, or more generally the FBM RKHS. In the first case, we obtain

If if we obtain the special case

or if we obtain the functional linear model (e.g., ymw05)

A smooth dependence model is obtained if is the Brownian motion or FBM RKHS. A special case is the smoothing spline model

when . However, there are many other potentially useful linear or smooth dependence models, such as the model where is a Brownian motion RKHS.

3.4.2 Models with two covariates

Model: , , ,
RKHS RKHS Model name Usual notation
Finite set Pearson Canonical Varying slope model
FBM Canonical Varying coefficient model
FBM Canonical Multivariate regression
FBM Canonical Functional response model
Table 3: Some models with two, possibly multidimensional covariates

We next consider examples of regression models (1) where is of the form (9) with . We can write

(15)

where is in the centered RKHS over a set with kernel , and .

First consider the case that and is a finite set. Taking to be the canonical RKHS yields the varying slope model, that is, for each element of , we have a linear dependence model. The usual representation of this model is as a two-level regression model asserting that the depend both on the covariate and on the cluster ,

(16)

where

(17)

Here, is called the intercept for cluster and its slope.

Next consider the case . The canonical RKHSs give the model

Note that this approach is suitable if the and are measured on different scales, such as height and weight. If the are measured on the same scale, e.g., height measured at two different time points, it may be better to consider the pairs as a single covariate in and use the approach in Section 3.4.1.

If is the canonical RKHS and the Brownian motion RKHS, we obtain

This has been called the varying coefficient model Hastie  Tibshirani (1993), where is the varying regression coefficient for the .

3.4.3 Functional response model

We now consider the case that, rather than scalars, the are real-valued functions over a set . A regression model then be formulated as

(18)

where

If is a finite set, can be viewed as a vector in . Then if additionally is the canonical RKHS over , we obtain the usual multivariate regression model

where and .

In practice, we do not observe entirely but rather a finite set of evaluations at index points , i.e., we observe . For example, in a repeated measurements setting, both the number of measurements and the times of measurement may be different for different units . Then (18) becomes

Note that this can be viewed as an instance of model (1).

In Section 6.5 we apply this model, as well as an extension with an extra covariate, to a longitudinal data set, taking the FBM RKHS.

3.4.4 Multi-class classification

Consider a multi-class classification problem where, with a finite set of classes, we have observations for and . The aim is to find a prediction function to predict the class for a future observation . We can use the present framework as follows. Let if and let otherwise. We may now consider the model

where the satisfy the restriction

Assuming , we then have and we may consider a decomposition

(19)

subject to the identifying restrictions , for all , and for all . Hence, we may assume is the centered canonical RKHS, and is any appropriate RKKS. Note that a main effect for is not needed in (19).

Of course, more than one covariate can be incorporated, for example with we may take