Links between frequentist and Bayesian approaches to smoothing were highlighted early on in the smoothing literature, and power much of the machinery that underlies the modern generalized additive modelling framework (implemented in software such as the R package mgcv), but they tend to be unknown or under appreciated. This article aims to highlight useful links between Bayesian and frequentist approaches to smoothing, and their practical applications (with a somewhat mgcv-centric viewpoint).READ FULL TEXT VIEW PDF
This paper is about Generalized Additive Models (GAMs; e.g., Wood, 2017)111 Well, it’s also about Generalized Additive Mixed Models (GAMMs) and
Generalized Linear Mixed Models (GLMMs) too.
Well, it’s also about Generalized Additive Mixed Models (GAMMs) and Generalized Linear Mixed Models (GLMMs) too.. That is, models of the form:
where and where () is the response and indicates an exponential family distribution with mean and scale parameter .
is a row vector of strictly parametric components (i.e., things that look like regular GLM components) andare their associated coefficients. The s are “smooth” functions of one or more of the covariates (, etc). Smooth terms are in turn constructed from sums of simple basis functions (e.g., DeBoor, 1978). In general for some smooth of covariate we have the following decomposition:
where are fixed basis functions and
are coefficients to be estimated (one can always augment the design matrix,, so the parametric and smooth components can be written as a matrix-coefficient-vector multiplication i.e., ). Smooth terms tend to be flexible so, to avoid overfitting, we penalize the flexibility of each smooth term according to its wiggliness. Generally such a penalty will be an integral (sometimes a sum) of integrated, squared derivatives of . This penalty can be written in the form where is a vector of coefficients, is a matrix of the fixed parts of the penalty (integrated, squared derivatives of the s; see examples below) and are smoothing parameters to be estimated that control the influence of the penalty ( is a vector of all the coefficients for all terms and the
s are padded to form a block diagonal matrix)222Note there are more general forms that are possible (where e.g., smoothing parameters are shared or transformed), but I restrict myself to this form here.. More than one term of the summation might correspond to a single in our model—i.e., there may be multiple penalty terms corresponding to a single smooth. We want to estimate:
where is the log-likelihood and there are smoothing parameters to estimate. Conditional on the smoothing parameters (), estimation of is relatively straight-forward and the problem can be attacked with penalized iteratively re-weighted least squares (PIRLS; e.g., Wood, 2017, section 6.1.1). However estimating both and is more complicated. Several strategies have been suggested, including: Akaike’s Information Criterion (AIC)/UnBiased Risk Estimator (UBRE)/Mallows’ (Craven and Wahba, 1978), Generalized Cross Validation (GCV; Wahba and Wold, 1975), REstricted Maximum Likelihood (REML; known as “generalized maximum likelihood” by Wahba 1985) and Marginal Likelihood (ML)333Backfitting (Hastie and Tibshirani, 1986) could be included in this list but doesn’t itself include a way of estimating smoothing parameters, though e.g., GCV could be used.. These methods essentially split into two classes: those that minimize prediction error (AIC, UBRE, Mallows’
and GCV) and likelihood-based approaches which cast the smooth functions as random effects and smoothing parameters as variance parameters (REML and ML)(Wood, 2011). This paper will only address REML estimation, which has been shown theoretically by Reiss and Ogden 2009 and in practice in Wood (2011) that GCV can overfit (undersmooth) at finite sample sizes and can be subject to issues with multiple optima (though GCV is still asymptotically optimal as ).
GAMs are often taught as “an extension of the linear model” – adding wiggles to make a (G)LM more flexible (often as a more principled step forward from adding polynomial terms into a linear model). When we talk about adding these smooth functions to our model, we tend to concentrate on equations like (1), looking at the mean effects of including smooths rather than thinking about the penalty. One might think of the penalty as a way of constraining our fit not to be too wiggly, ensuring that our model isn’t overfitting. There are two additional (non-exclusive) perspectives on the basis and penalty that will be useful to consider:
Basis functions are “solutions” to objective functions like (3); in that sense the basis functions are derived from the objective and are “optimal” given that choice of objective (i.e, the form of the ). We can therefore thing of the bases as consequences of the problem definition (specifically in terms of wigglyness and dependency in the data).
Ruppert et al. (2003) and Verbyla et al. 1999 (see references therein, in particular Speed (1991)) show that smooths can be formulated in a way such that they can be fitted using standard mixed model approaches. In that case we view the coefficients of the smooths () as random effects and the penalties are their associated (prior) precision matrices. Though this is primarily seen as a computational “trick”, the interpretation will be useful below.
Most often folks think of splines (DeBoor, 1978) when they hear “smooths”, but there are many possible model terms that can be specified as a basis with a (sum of) penalties. These include (Gaussian) Markov random fields (Rue and Held, 2005), Gaussian processes (Rasmussen and Williams, 2006), random effects (Speed, 1991) and varying coefficient models (Hastie and Tibshirani, 1993). Here I use the word smooth to include all these possible flexible model terms.
Bayesian interpretations of the smoothing process date back to at least Whittle (1958) (Silverman, 1985). Thinking about a single smooth in the model, which we’ll call , the approaches fall into two categories (Hastie and Tibshirani, 1990, section 3.6):
infinite-dimensional: is a sum of a random linear function and an integrated Wiener process, combining this with a likelihood for the data gives a posterior mean for , this is a smoothing spline with appropriate .
finite-dimensional: use a basis expansion as (2) and setup priors on the smoother’s coefficients (the ). In combination with a likelihood for the data, a normal prior on will lead to the posterior modes of coinciding with their maximum likelihood estimate.
The infinite-dimensional approach is problematic as a prior (or priors) must be set up over the complete space of possible smooth functions for each smooth term in the model; this is conceptually tricky (Wahba, 1983; Silverman, 1985)
. One can think of the infinite-dimensional problem proposed in 1. as having a finite-dimensional solution in 2. (a basis expansion); this is known as the “kernel property” in the support vector machine literature(Hastie et al., 2009). These differing Bayesian approaches give the same posterior mean (point estimate) but different posterior uncertainty (Hastie and Tibshirani, 1990) (see section 3.2 below).
Here we’ll think about 2. as the more practical alternative (and the one that has been explored more in the literature). From this we then have two choices as to how to do estimation:
“Full Bayes” (FB): in which we formulate a likelihood and attach priors to the smoothing parameters, as well as the model coefficients. This is easily achievable via mgcv::jagam (which implements translation between mgcv syntax and JAGS syntax; Wood 2016), greta444Using the same trick as jagam with Hamiltonian Monte Carlo to speed up sampling, Golding and Miller, https://greta-dev.github.io/greta/. or BayesX (Fahrmeir and Lang, 2001; Fahrmeir et al., 2004; Brezger et al., 2005).
“Empirical Bayes” (EB): in which we treat the smoothing parameters as fixed effects and estimate them from the data. This is what REML estimation does (Wood, 2011) and can be interpreted as a Best Linear Unbiased Predictor (BLUP; Speed, 1991). The REML criterion has exactly the same form as the Bayesian marginal likelihood, for computational efficiency the Laplace approximation is often used (which is exact in the Gaussian case).
We explore these two approaches below in the examples. Section 3.4 shows an example of using JAGS/jagam to fit a GAM.
Note that we can get to the Bayesian approach conveniently by exponentiating the penalized estimation problem in (3), which gives us:
If we setup a penalty for some smooth term in our model, what are we really telling the model to do? We are, in some sense saying “this is how we want this term to act”—we are specifying that observations which are close to each other in covariate space have similar values, the response varies smoothly and that the true function we seek to estimate is more likely to be smooth than wiggly (hence penalizing wigglyness). The Bayesian formulation allows us to be more explicit about these beliefs (Wood, 2017, section 5.8). In general if we want to fit a model there is no unique solution unless some restriction is put on the form of (Watson, 1984). Fahrmeir et al. 2010
expand on the idea of Bayesian regularisation and its interpretation: techniques such as ridge regression, lasso,regularization, elastic net, etc are conditionally Gaussian, in the sense they can be expressed as Gaussian priors with different variance specifications leading to rather different models.
Following on from the above, the priors for the coefficients of smooth components of the model are specified as , where is the pseudoinverse of the penalty matrix (Kimeldorf and Wahba, 1970; Wahba, 1978, 1990). A pseudoinverse is required as usually is rank-deficient. This due to some basis functions not having derivatives, so they don’t get penalized, we refer to these terms as “in the nullspace” of the penalty; in most 1D splines this is the slope and intercept. Basis functions in the nullspace of the penalty have infinite variance (improper uniform priors) (Wood, 2006); Hastie and Tibshirani (1990, section 3.6) state “one is assuming complete prior uncertainty about the linear and constant functions, and decreasing uncertainty about the higher order functions.” As the penalty will have larger values for more wiggly components (we penalize those more strongly) the corresponding variance component will have less uncertainty (as we don’t think that the function should be “too” wiggly), this makes explicit our belief that smooth functions are more likely than wiggly ones (Wood, 2006)
. If we opt for a fully Bayesian approach we put (“hyper”)priors on the smoothing parameters (which are effectively variance components) and use a diffuse but proper prior such as the gamma distribution. Specifying priors on the precision of components of “multilevel”/“hierarchical” models is a tricky business(Gelman, 2006; Simpson et al., 2017)555See also http://andrewgelman.com/2018/04/03/justify-my-love/., this is especially the case for smoothing parameters as the “true” values of the smoothing parameter(s) could be infinite (or at least numerically infinite for computing purposes), e.g., for components that are truely linear terms.
There are penalties (and hence priors) which lead to completely penalized basis functions (e.g., the P-spline approach in Lang and Brezger, 2004), in which case the prior for is proper. This can be done for any smooth by using the methods in section 3.3. It is also worth noting that identifiability constraints (Wood, 2017, section 5.4.1) that need to be imposed on the model may lead to proper priors, so these more complex methods may not be necessary in some situations (Marra and Wood, 2011).
When using splines we must also decide on knot placement/number and basis dimension ( above; these are usually linked). Exactly how this is done depends on the basis that is used, but it is clear that since involves basis functions (or at least their derivatives), so the number of basis functions (and/or number of knots) and knot placement will affect the posterior. Though the knot selection problem can be avoided via simple even grid spacing (which may be computationally demanding) or with eigen-based approaches like thin-plate regression splines (Wood, 2003) there are often reasons to involve humans in knot placement (or at least let them dictate a rule for knot placement). Using more basis functions/knots than “necessary” and equally spacing them allows wigglyness to be dictated by the smoothing parameter, rather than making results sensitive to multiple “parameters” in the model (Wood, 2017, section 5.9).
Turning the crank on the Bayesian sausage machine, we can get to the posterior marginal for : where for the Gaussian likelihood case and for the exponential family case we have , where is a variance parameter and is a scale parameter (Wood, 2017, section 6.10).
For FB, will include uncertainty about the smoothing parameter(s). For EB we only have information conditional on the value of the smoothing parameter(s). Wood et al. (2016) propose a correction to to account for the uncertainty in the smoothing parameter(s) using a Taylor expansion to approximate the extra variability, adapting the methods of Kass and Steffey 1989. These three versions of (FB, EB and EB corrected) are explored in section 3.4.
The Bayesian results above lead to some useful applications. Here I highlight a couple of the more commonly-used ones. We’ll investigate these properties using cubic regression splines (Wood, 2017, section 5.3.1). The penalty for this basis is:
where and are the first and last knots. This leads us to define the element of the penalty matrix, as:
The 10 basis functions are shown in the left panel of Figure 1 and the right shows the estimates (basis functions multiplied by their estimated coefficients and, in red, the final estimated smooth). Figure 2 shows the penalty matrix (left) and corresponding prior variance-covariance matrix (right) for such a model; both are highly structured. Note that these matrices are only , as in this case the linear term (flat line in the large right plot in figure 1) is unpenalized. The penalty shows the largest values (most yellow) on the diagonal, with smaller values off the diagonal and further from the diagonal (the smallest being the very dark blue off-diagonal band). This emphasizes the local nature of the basis (Hastie and Tibshirani, 1990, section 2.10).
Data used below were generated using the gamSim function in the package mgcv using the test functions from Gu and Wahba (1991), in each case noise was added.
Since we know the posterior distribution of the coefficients (in the EB case conditional on the smoothing parameter(s)), we can use this result to simulate from the posterior and look at possible smooths that the model can generate. This can be useful to visualize the uncertainty in the fitted smooth, but can also be useful for calculating uncertainty about summary statistics of predictions from the fitted model.
We can do this in general by following this algorithm (Wood, 2017, section 7.2.7):
Let be the number of samples to generate.
Form , the matrix that maps the model covariates to the linear predictor (the prediction equivalent of the design matrix).
For in :
Calculate the linear predictor .
Apply the inverse link function, , so .
Calculate and store the required summary over .
Perform inference on the summaries (e.g., calculating empirical variance, percentile intervals, etc).
These results can be used to calculate credible intervals for functions of the predictions(Wood, 2017, section 6.10). For example, this technique can be useful for calculating time series summaries (e.g., count in a given year) for complex spatiotemporal models (e.g., Augustin et al., 2009; Marra et al., 2012; Augustin et al., 2013). Example of simulated smooths are shown in figure 3.
It is useful to construct confidence intervals around the smooths in a model not only as a visual check but also for hypothesis testing. Using the results in section 2.2 for the posterior of , we can construct estimates of uncertainty about our fitted model. We can build pointwise “confidence intervals” (to use the scare quotes of Wahba 1990) as , where is our estimated smooth, is the variance of the smooth at point and is the usual appropriate value from a normal CDF. These intervals have good frequentist across-the-function properties: that is, a 95% credible interval has close to 95% coverage, when averaged over the whole function (there may be over and under coverage at the peaks and troughs of the function). Justification for these intervals was developed in Nychka (1988) and expanded to the GAM case in Marra and Wood (2012). The Bayesian perspective is important here as frequentist uncertainty measures will be subject to smoothing bias in the model terms (Wahba, 1990, chapter 5), the Bayesian formulation includes a term that accounts for this bias and hence the intervals have good coverage properties (Wood, 2017, section 6.10.1). Intervals for the cubic spline model fitted in the previous section are shown in figure 3. Hodges and Clayton (2011, section 4.1.1) has a good example of why taking a simplistic frequentist view of constructing the confidence intervals leads to incorrect inference.
Since these intervals have good coverage and tell us about the whole function (by the “across-the-function” property), we can use them to test the hypothesis —whether a term should be dropped from the model. Considering the block of which relates to a given smooth in the model, we can construct such a test. See Wood (2017, section 6.12.1) and references therein for more details.
An alternative method to hypothesis testing for term selection in the model is to use shrinkage/penalty-type methods to remove terms during fitting. Many approaches are possible (see Marra and Wood (2011) for some examples) but here I focus on two approaches implemented in mgcv.
As described in section 2.1, the prior placed on can be improper due to rank deficiency in , this is caused by some basis functions having no derivatives (of the order of the penalty) as as such they are not penalized. We can make our priors proper by simply adding an extra penalty term to the model for the nullspace components of each term (the “double penalty” approach of Marra and Wood (2011)), this is achieved by eigendecomposing the penalty matrix then forming the matrix where
is a matrix of eigenvectors corresponding to the zero entries on the diagonal of
(zero eigenvalues). This approach is implemented as theselect=TRUE option in mgcv::gam, and includes one additional smoothing parameter for each smooth term in the model. Alternatively one can form a basis where the nullspace terms have a shrinkage penalty applied to them by simply adding a small to the diagonal entries of so that the resulting penalty matrix is not rank-deficient (the “shrinkage” approach of Marra and Wood (2011); implemented as the ‘‘cs’’ and ‘‘ts’’ bases in mgcv).
These two approaches lead to rather different interpretations of how wigglyness should be penalized: the double penalty approach makes no assumption about how much to smooth the nullspace in the penalty relative to the other parts of the smooth (and determines this during fitting), the shrinkage approach however assumes that the nullspace should be penalized less than the other parts of the smooth. These are of course not the only options (simply those which are implemented in software) and there may be reasons for penalizing different parts of the nullspace in different ways (and to different degrees).
Looking at the cubic splines, if we add additional smooths of (nonsense) covariates to the model, using the either of these techniques should remove those terms during model fitting. Figure 4, bottom right panel shows the term being removed from the model using the cubic regression spline with shrinkage. The top right shows the result when shrinkage is not applied.
The adaptation of the model using the extra penalty described in Marra and Wood (2011) allows us to construct proper priors for any term in our model. We can then do fully Bayesian fitting via Gibbs sampling using JAGS (Wood, 2016). We can then compare the uncertainty estimates obtained using the empirical Bayes approach (implied flat prior on the smoothing parameters) and the fully Bayesian approach (proper gamma priors are put on the smoothing parameters). Note that this approach is relatively inefficient (sampling-wise) and much work has been done on efficient MCMC schemes (Fahrmeir and Lang, 2001; Brezger et al., 2005; Rue and Held, 2005). Alternatively one could use exactly the same approach with Hamiltonian Monte Carlo schemes, like those implemented in Stan (Carpenter et al., 2017) and greta. One of the big advantages of this type of approach is to build larger, more flexible, hierarchical models within the MCMC samplers provided by these general software frameworks.
As an example we set up our model as:
with the same data as the previous example (see figure 4). A cubic spline basis was again used with a multivariate normal prior on the s and a vague gamma prior on the smoothing parameters (of which there are four, two for each penalty), we can then use a Gibbs sampler to obtain posterior samples. When estimated from simulation (as here) includes a component accounting for uncertainty in the smoothing parameter (Wood, 2016). Plotting the matrices for empirical Bayes estimator (both uncorrected and corrected for smoothing parameter uncertainty) and for the fully Bayesian case, we see that the corrected estimates and fully Bayesian have greater uncertainty especially in the top left corner of the matrix. This corresponds to the smooth of , which contains no signal (as in the previous example). In the uncorrected case, we don’t take into account the uncertainty we have in the smoothing parameter, the model is certain that the term should be flat.
This article has highlighted the Bayesian interpretation of generalized additive models (specifically as implemented in mgcv), which are often thought of as a “frequentist only” method. These results and the connections between the two approaches have been known for some time (at least as far back as Whittle (1958)). Though the connections are old, recent work has shown that there are many aspects which are only just starting to be exploited by practitioners.
The random effects interpretation of the models described here is very useful, though there is considerable confusion over what “random effects” really mean (leading to at least one statistician (Gelman, 2005) to propose that the term be abolished). Hodges and Clayton 2011 characterize “old” and “new” random effects. The models discussed here are firmly in the “new” camp as, in their words: “the levels of the effect are not draws from a population because there is no population. The mathematical form of a random effect is used for convenience only” and that they are “formal devices to facilitate smoothing or shrinkage, interpreting those terms broadly”.
Computationally, the interpretation of REML estimation as either “classical” or Bayesian has been covered in the literature (Robinson, 1991), the equivalence of REML to marginalization of the fixed effects to give an “approximately Bayesian” estimate goes back to Harville 1974. Intuitively, REML samples from the prior implied by the penalty and evaluates the “average likelihood” given that draw—poor values of the smoothing parameter lead to curves too far from the data, either by being too smooth or not smooth enough (Wood et al., 2016, section 6.2.6).
Probably the closest relatives of the kind of models described above are those fitted using INLA (Rue et al., 2009; Lindgren et al., 2011). These are Bayesian additive models, restricted to the subset of models which are latent Gaussian. From Rue et al. (2009): “Latent Gaussian models are a subset of all Bayesian additive models with a structure additive predictor [(3
) above], namely those which assign a Gaussian prior to [their parameters] … the vector of hyperparameters which are not necessarily Gaussian”. This is exactly the case we have here. The major point of divergence between INLA and the GAMs specified here is that the structure of the prior variance-covariance matrices: those here do not lead to corresponding sparse precision/penalty matrices.
The MCMC approach starts to become more useful when smooth components are just one part of a complex model, especially those with large, hierarchical random effects structures (Wood, 2016). This ability to embed GAMs within larger models should facilitate some interesting developments for combining different data types for different models. For example, when building species distribution models of biological population, there are often many different data sources available, some include only whether the animal is present at a location, some if it is present or absent and others include a count of the animal and even an estimate of the probability that an animal is detected. One can imagine setting up a model for presence only data, which then gives a posterior estimate of the probability of presence, if the same spatial basis functions are used for a subsequent model built using the presence/absence data, the presence only parameter estimates can be used as a prior. The higher-quality data (in the sense of being more information rich, as well as potentially more reliable due to better field methodology) can be used to update the species distribution.
Bayesian interpretations of GAMs fitted “via frequentist methods” have been helpful to construct more reasonable estimates of uncertainty (including smoothing parameter uncertainty) and in order to understand how to construct confidence intervals that have good coverage properties (and what those properties really mean). These links can surely be used further to develop other new methodology and enhance our understanding of the models that we fit. In the author’s view, it is a shame that these conceptual links have not been better recognized and exploited further; even a very popular textbook (Ruppert et al., 2003) describes the mixed model representation of the GAM as a “convenient fiction” and consider the priors “reasonable (but not compelling).” Moving beyond mere computational convenience and harnessing the broader Bayesian framework implicit in this modelling strategy seems like a fertile ground for future work.
This work was funded by OPNAV N45 and the SURTASS LFA Settlement Agreement, and being managed by the U.S. Navy’s Living Marine Resources program under Contract No. N39430-17-C-1982. I would like to thank the following people for useful discussions/provocation: Jay ver Hoef, Steve Buckland, Len Thomas, David Borchers, Mark Bravington, Richard Glennie, Andrew Seaton, Daniel Simpson and Finn Lindgren.
An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach: Link between Gaussian Fields and Gaussian Markov Random Fields.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423–498, Sept. 2011. ISSN 13697412. doi: 10.1111/j.1467-9868.2011.00777.x. URL http://doi.wiley.com/10.1111/j.1467-9868.2011.00777.x.
Gaussian Processes for Machine Learning. MIT Press, 2006.
Smoothing and interpolation by kriging and with splines.Journal of the International Association for Mathematical Geology, 16(6):601–615, Aug. 1984. ISSN 0020-5958, 1573-8868. doi: 10.1007/BF01029320. URL http://link.springer.com/10.1007/BF01029320.
On the Smoothing of Probability Density Functions.Journal of the Royal Statistical Society. Series B (Methodological), 20(2):334–343, 1958. URL http://www.jstor.org/stable/2983894.