Selective Inference for Additive and Linear Mixed Models

07/15/2020 ∙ by David Rügamer, et al. ∙ 0

This work addresses the problem of conducting valid inference for additive and linear mixed models after model selection. One possible solution to overcome overconfident inference results after model selection is selective inference, which constitutes a post-selection inference framework, yielding valid inference statements by conditioning on the selection event. We extend recent work on selective inference to the class of additive and linear mixed models for any type of model selection mechanism that can be reapplied to new data in a bootstrap-like manner. We investigate the properties of our proposal in simulation studies and apply the framework to a data set in monetary economics. Due to the generality of our proposed approach, it is particularly suitable for the given application for which the final additive mixed model is selected using a hierarchical selection procedure based on the conditional Akaike information criterion and involves varying data set sizes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

This work addresses the problem of conducting valid inference for additive and linear mixed models after model selection. One possible solution to overcome overconfident inference results after model selection is selective inference, which constitutes a post-selection inference framework, yielding valid inference statements by conditioning on the selection event. We extend recent work on selective inference to the class of additive and linear mixed models for any type of model selection mechanism that can be reapplied to new data in a bootstrap-like manner. We investigate the properties of our proposal in simulation studies and apply the framework to a data set in monetary economics. Due to the generality of our proposed approach, it is particularly suitable for the given application for which the final additive mixed model is selected using a hierarchical selection procedure based on the conditional Akaike information criterion and involves varying data set sizes.

1 Introduction

In practice, model selection is often done prior to hypothesis testing, or tests themselves are used for model selection. Very often, inference results are used without adjusting for preceding model selection, or iterative test procedures are based on uncorrected null distributions, thereby increasing the type I error notably

(see, e.g., Fithian et al., 2014). For mixed models, model selection is done in a variety of ways, e.g., by using the conditional Akaike information criterion (cAIC, Greven and Kneib, 2010; Säfken et al., 2014, 2019) to select among different random effect structures or based on tests (see, e.g. Kuznetsova et al., 2017). Any of the used selection algorithms implies the necessity for adjusted inference post-model selection to avoid overconfident inference results. Automatic selection procedures that iteratively apply a statistical test for variable selection, in addition, are based on an incorrect distribution assumption in all but the first step and thus may yield undesirable results since not all hypothesis tests of each variable are affected to the same amount.

To overcome overconfident inference results after model selection, a variety of post-selection inference approaches exist. These adjust classical inference for the additional stochastic aspect in selecting the model and hence the tested hypothesis. Sparked by various proposals such as Berk et al. (2013), many authors have addressed the problem of valid inference after model selection in different ways. Berk et al. (2013) establish an inference concept, which is independent of the used model selection procedure and often referred to as Post-Selection Inference (or short PoSI). Another research direction in valid post-selection inference focuses on selective inference, a post-model selection procedure that motivates a range of methods that target specific selection algorithms, especially the Lasso (see, e.g., Fithian et al., 2014; Lee et al., 2016; Tibshirani et al., 2016)

. We focus here on the concept of selective inference. Selective inference allows for conducting valid inference conditional on a certain selection event by adjusting the distribution of widely used test statistics. Most of the proposed selective inference methods, however, only deal with the linear model.


Our contribution.

In this work we close an important gap in the application of linear and additive mixed models (LMMs). In particular, we extend the selective inference framework to the class of linear and additive mixed models and show how to allow for any type of model selection mechanism re-applicable on new data. This allows us to develop valid inference post model selection for fixed and random effects as well as linear combinations of these effects in LMMs. Our work thereby contributes to and extends research on selective inference by a) incorporating random effects estimated with shrinkage, b) extending the ordinary least squares-type test statistics used in selective inference and providing results for post-selection inference based on marginal and conditional distributions in mixed models, c) proposing and evaluating different approaches to overcome the assumption of known error and random effect covariance matrices, d) transferring our approach to selective inference for additive (mixed) models and e) adopting recent ideas on Monte Carlo approximation to allow for inference in these models after any model selection criterion that can be stated as a deterministic function of the response. We additionally make available an R package

selfmade (SELective inFerence for Mixed and Additive model Estimators) and all codes for simulations on Github (https://github.com/davidruegamer/selfmade).

In the following, we summarize the theory of selective inference for linear models in Section 2 and extend existing approaches for selective inference to mixed models in Section 3. We further derive a selective inference concept for additive models in Section 4 and present simulation results for the proposed methods in Section 5. We apply our results to the world inflation data, compare results with existing approaches in Section 6 and summarize our concept in Section 7.

2 Selective Inference

We start with briefly summarizing selective inference foundations in Loftus and Taylor (2015); Yang et al. (2016); Rügamer and Greven (2020) and describe different concepts for the linear model. We will therefore introduce our notation which is later extended for linear mixed models. Let

be a vector of

independent random variables following

with mean vector

, variance

, the

-dimensional identity matrix and

a matrix of covariate columns . We are interested in estimating , the conditional expectation of , which we model by a linear combination of a subset of the data repository , where is an column indexing set and denotes the power set of a set . One key aspect of selective inference is that no assumption on the true underlying mean structure is made. Instead the true mean can have an arbitrary structure, can be potentially non-linear in observed covariates and may depend on unobserved covariates not contained in . The inference goal after selection of some working structure from the set of all potential effects based on is to infer about , the coefficients for the projection of onto the column space of . However, we would like to correct resulting inference statements for the fact that we have selected the set of covariates with some selection procedure depending on the data , i.e.  for the observed realization of . In particular, we are usually interested in the effect of one specific direction of the projection , which is denoted by with th unit vector . We call this the th selective direction effect or short SDE in the following. SDEs and corresponding quantities are now discussed in more detail in the light of defined hypotheses and test statistics.

2.1 Test Statistic and Test Vector

We now define a general hypothesis and corresponding test statistic for either testing a single coefficient or testing a group of coefficients. When testing a group of coefficients we are interested in a projection of onto a linear with ,

, and our null hypothesis can be stated as

(1)

where is the value assumed under the null hypothesis and is the Euclidean norm (Yang et al., 2016; Rügamer and Greven, 2020). We define the test statistic as

(2)

We follow Loftus and Taylor (2015); Yang et al. (2016); Rügamer and Greven (2020) and define with denoting the selected design matrix without the column(s) corresponding to the th selected (group) variable and corresponding to the projection onto the orthogonal complement of the subspace spanned by . This corresponds to testing whether the th group in is correlated with after correcting for all other covariates in the model .

When testing single coefficients, , is a vector and . Alternatively, a signed and unscaled null hypotheses

(3)

and test statistic

(4)

can be defined, where corresponds to the th selective direction effect in linear models , namely the th component of the coefficient vector in the projection of into , and , which yields .

2.2 Inference Region

Given the direction(s) of interest or , we are interested in the distribution of or conditional on the selection event . For many selection procedures, the test statistic is restricted by the selection event to some set , i.e. has to lie in conditional on . If

follows a Gaussian distribution and model selection is, e.g., performed by the Lasso, the restricted space can be described explicitly

(see, e.g., Lee et al., 2016; Tibshirani et al., 2016), turns out to lie in a polyhedron and is an interval in this polyhedron.

thus follows a truncated normal distribution with support

when conditioning on the selection event. In other cases, the selection induces an inference region that cannot be characterized mathematically (Yang et al., 2016), or while it can be characterized, calculating the resulting truncation for is not feasible (Rügamer and Greven, 2020). A possible solution for these cases is to numerically explore the region restricting the test statistic and its distribution using Monte Carlo sampling (Yang et al., 2016; Rügamer and Greven, 2020).

2.3 Monte Carlo Approximation

Monte Carlo approximation for selective inference has been proposed by several authors (see, e.g., Fithian et al., 2014; Yang et al., 2016; Rügamer and Greven, 2020). For the linear model with known variance , observe that we can decompose as with and . Alternatively, for the unscaled version (4), we can decompose as and with . When conditioning on the selection event , where for the given realization , and when additionally conditioning on , the contribution of the SDE can be quantified by the magnitude of (Yang et al., 2016). When we additionally condition on , the only variation of left is in . An important special case is given for and , i.e., when testing the significance of the SDE of a group. When conditioning on and , we have restricted to . If and , Yang et al. (2016) allow for computation of p-values by deriving , the (conditional) density of (conditional on the selection event, and ), and rewriting the p-value for the observed value as a ratio of two expectations, which can be approximated numerically.

Along the same lines Rügamer and Greven (2020)

use this idea to calculate p-values and confidence intervals (CIs) for a normally distributed test statistic

with potentially multiple truncation limits. In this case without restriction and after selection,

follows a truncated version of this normal distribution with the same moments but support

. P-values and CIs can be calculated using

(5)

corresponding to for testing a certain null hypothesis or, by inverting the test, for CIs by searching for lower and upper interval bounds , such that

By drawing samples , , of from the normal distribution (or samples of from a -distribution), we can check the congruency with the initial selection given by by defining (or for the grouped effect test) and evaluating . Given enough samples , p-values can be approximated by replacing the expectations in (5) by sample averages. Since the survival function of is monotone in its mean, we can furthermore invert the hypothesis test to construct a selective confidence interval as proposed in Yang et al. (2016); Rügamer and Greven (2020)

. In cases where the observed test statistic lies in an area where the null distribution has little probability mass, empirical approximation of (

5) might be difficult or sampling might even yield only values outside the inference region of interest for realistic values of . We therefore adapt and extend the approaches by Yang et al. (2016); Rügamer and Greven (2020) and use an importance sampler with proposal distribution or a mixture of normal distributions with different locations.

3 Selective Inference for Linear Mixed Models

For the analysis of longitudinal or clustered data, the linear mixed model is a natural choice. In addition to a wide range of linear mixed models (LMMs), any additive model incorporating a quadratic penalty can be framed as a mixed model (Wood, 2017), further extending the LMM model class. We use the following linear mixed model notation:

(6)

with residual term and random effects , with corresponding covariance matrices and , respectively. In the following, this setup is referred to as the working model as we do not necessarily assume the true data generating process to be of the form (6).

When adapting the principles of selective inference for LMMs, some of the prerequisites for the framework proposed by Lee et al. (2016); Yang et al. (2016); Rügamer and Greven (2018) remain the same, whereas several key aspects change and thus extensions of the previous works are required. First and foremost, LMMs come with different distributional viewpoints. Inference and model selection can be conducted based either on the marginal distribution

(7)

with or based on the conditional distribution . We here present two different ways to conduct selective inference for LMMs based on the marginal and on the conditional perspective.

For both perspectives and the corresponding proposed inference concepts, we first discuss the setup, defining assumptions as well as inference goals, and then determine distribution and test statistic to be used for inference. In contrast to most of the literature on selective inference, both approaches are not restricted to specific model selection criteria but only require the model selection procedure to be repeatable such that it can be applied to newly created data in a Bootstrap-like fashion. Our approach extends proposals in Yang et al. (2016); Rügamer and Greven (2020) for linear models. For the marginal perspective the focus lies on the selection of fixed effects, while the conditional perspective can also be used if random effect selection is of interest, e.g., when using the conditional Akaike Information criterion (cAIC; see, e.g., Greven and Kneib, 2010) as in our accompanying application, or when the mixed model is used to determine the structure of the marginal covariance matrix.

3.1 Marginal Perspective

We first discuss the marginal perspective, which exhibits strong connections to selective inference concepts proposed for linear models (LMs). When the question of interest - and the focus of model selection - are the fixed effects of the LMM, the marginal distribution of is typically used to conduct inference. We can then utilize the framework of Lee et al. (2016) and others, which provides selective inference statements in LMs for a normally distributed response with potentially non-diagonal variance-covariance matrix .

3.1.1 Setup, Assumptions and Inference Goal

We first assume a known covariance structures given by a corresponding mixed model with known random effects covariance , random effects structure and error covariance . We will discuss unknown covariance structures in Section 3.3. Overall, assume that

(8)

which is implied by , as well as the working model as defined in (6). is allowed to have a flexible structure, potentially incorporating effects of unobserved covariates or non-linear effects of . After selection of a linear fixed effects (working) structure from the set of all potential fixed effects based on , our goal is to infer about the th selective direction effect given a fixed covariance structure :

(9)

Alternatively, we may want to infer about a group of coefficients as introduced in Section 2.

3.1.2 Null Distribution and Test Statistic

In line with the previous works on selective inference in LMs, we first assume that is known. We are still interested in testing

but in addition to the test statistic with used in LMs (see, e.g., Rügamer and Greven, 2018), with null distribution in the setting of (8), we propose another test statistic with and null distribution

corresponds to the more efficient estimator in the presence of a non-diagonal variance-covariance matrix for . Like , the test statistic has the desired property in the case, in which the true mean is a linear combination of . Similarly to before, using this test statistic we can also decompose into in the direction of with and an orthogonal complement such that . This allows us to create new test statistics and corresponding response values for the observed that only vary in the direction of interest defined by . In contrast to the test vector proposed for LMs, in this case depends on .

3.2 Conditional Perspective

We now turn to the conditional linear mixed model perspective, where some assumptions can be relaxed, and which additionally allows for inference for additive models as presented in Section 4.

3.2.1 Setup, Assumptions and Inference Goal

Assume that and first relax the assumption of having a fixed random effect structure as well as known random effects covariance structure that was used for the marginal perspective. We will use a working mixed model to model the (conditional) expectation , where may or may not incorporate random effects. If is of the form as assumed in (6) in Section 3.1.1, we can also derive a corresponding marginal model. In the case where incorporates random effects, the conditional approach is more suitable if the random effect structure is not known beforehand and, in particular, if the random effects are part of the model selection. We denote the set of selected random effects out of all potential random effect candidates by with the corresponding index set. Let the goal of the analysis be to infer about the selective direction effect or of the true mean in the selected model, defined by

with . To test some null hypothesis on selected fixed effects or on selected random effects , we use the conditional distribution , the th entry in or analogously for , and define the selection procedure to return a tuple of set indices for fixed and random effects. Alternatively, we may want to test a group of coefficients or a certain linear combination

(10)

for some row vector and value . In this case, we denote . For simplicity, we state the proposal using in the following, but it can be equally applied for the more general linear combination case, replacing by .

3.2.2 Null Distribution and Test Statistic

We here relax the assumption of known or known and and only assume that is known. This allows us to conduct inference also in the case where the linear predictor or the random effect structure of the mixed model is misspecified, whereas in the marginal setup, the assumption of known requires the corresponding random effect structure to be correct. Starting from a working linear mixed model with known covariance of and (working) covariance of the (working) random effects , the fixed and random effects can be predicted using

In contrast to the linear model and marginal perspective in the previous subsection, we here distinguish between defining the th SDE and the test vector (or ) that is used to test the null hypothesis . The test statistic is and the conditional distribution of (without selection) is normal with and

(11)

For the null hypothesis , the null distribution of in general is with if . That is, the test statistic is not unbiased for the respective due to the shrinkage induced by the covariance . Since is assumed to be known for the moment (but cf. Section 3.3) and we use a known working covariance for in , we can compute and derive its conditional distribution given and the selection event. The choice of affects the amount of shrinkage we use in the test statistic (and potentially the power of the test). In practice, we use the estimated covariance matrix . If we set , the two vectors coincide. We compare both options ( vs. ) in practice in Section 5.

As (11) does not account for the shrinkage bias in general, we suggest the use of a Bayesian covariance, which yields better coverage when used in confidence intervals (Nychka, 1988; Marra and Wood, 2012) and is thus also used in frequentist inference. We use the Bayesian covariance

(12)

and allow to replace (11) by in our inference framework. Alternatively, when using the covariance (11), we rely on an asymptotic argument of the shrinkage effect decreasing with the sample size. We investigate the efficacy and impact of our approach in the simulation section.

This approach allows to conduct inference in a similar manner as for the linear model. In particular, if the restriction of the space of induced by the selection procedure can be described as affine inequalities as in Loftus and Taylor (2015), a corresponding truncated normal distribution for can be derived and inference can be conducted based on this distribution (see, e.g., Rügamer and Greven, 2018). Given the working random effect covariance and random effect structure , the test vector is fixed and we can produce samples in the direction of in the same way as for the marginal perspective. We therefore define and decompose into in the direction of and an orthogonal complement . We then generate samples by drawing from a proposal distribution, checking for congruency with the original selection and reweighting the samples using importance weights to approximate the expectation of formula (5).

3.3 Dealing with Unknown Error Covariance

As a known error covariance structure is not a realistic assumption in practice, Rügamer and Greven (2018) provided results on the effect when plugging in different estimates for the true variance in LMs, yielding approximately valid inference results in almost all investigated scenarios. These findings, however, assume a diagonal error covariance matrix. We therefore extend their work and investigate conservative estimators as plug-in solutions for both proposed approaches in linear mixed models.

For the marginal approach our proposal is to use , a conservative estimator for , which assumes the random effect structure is fixed, i.e., not part of the model selection, and as well as the error variances are unknown but estimated. is given by the variance-covariance estimator of the intercept model with the given random effects structure, i.e., , which leaves as much variance in the response as possible unexplained by the mean model. This estimator for is also used in the newly proposed test statistic , which depends on . An alternative approach in the case of grouped data with balanced designs, which we do not pursue here, could estimate as a block-diagonal matrix with unstructured covariance blocks from an intercept-only model.

For the conditional approach we again note that the random effects structure is part of the working model and we use the estimated random effects covariance . We therefore can build on the results by Rügamer and Greven (2018) plugging in a variance estimator for the residual variance and investigate the effect in the simulation studies. When using the Bayesian covariance (12) we replace with the corresponding working covariance from the definition of .

4 Additive Models

As additive models can be estimated using a mixed model representation, our proposed approach also provides ways to conduct selective inference for additive (mixed) models when using the conditional perspective. For illustrative purposes assume that the model selection results in a simple additive (working) model of the form

with with diagonal matrix , covariate and denoting component-wise evaluation of on . Inference statements for more complex models with additional linear, non-linear or random terms can be derived in the same manner by extending the following design matrices. We assume with the -dimensional identity matrix. The non-linear function can be approximated using a spline basis-function representation with basis functions and for a given value , this yields the design vector and spline approximation . The design matrix contains rows . The coefficients can be estimated using least squares or, alternatively, using penalized least squares

(13)

to induce smoothness of the estimated function using some penalty matrix and smoothing parameter . This is commonly done by rewriting (13) (after re-parameterization yielding ) as the estimation problem of a linear mixed model (6) with , , and (for more details, see Ruppert et al., 2003).

Model selection in this case is more appropriate when considered from a conditional perspective, as is represented in this approach using both fixed and random effects . The conditional distribution of keeps fixed across observations in the mean structure and treats the random effects in the linear mixed model as just a mathematical tool (working model) to estimate with regularization (Greven and Kneib, 2010). We therefore follow the conditional inference approach as discussed in Section 3.2 for additive models. An estimate of for a certain value is given by

A test vector can thus be defined as

(14)

with defined as . Compared to Section 3.2, this replaces the th unit vector with the linear combination inducing vector . As before, we use as our test statistic. In the case of non-linear functions , is defined as a block-diagonal matrix with blocks , with smoothing parameter and penalty matrix corresponding to the th term .

For point-wise confidence bands for non-linear functions , we investigate two possible options. The first idea conditions on any random effects in and uses the conditional variance of the test statistic , analogous to the linear case. An alternative way to construct confidence intervals uses the Bayesian covariance matrix

(15)

to correct for the shrinkage effect, which leads to biased estimates even if the model is correct. As for the linear (mixed) model, we use a Monte Carlo approximation for the p-value or confidence interval by generating samples from the null distribution (or proposal distribution when using importance sampling) either with frequentist or Bayesian covariance definition. We therefore define as in Section 3.2 using instead of , and check for congruency of the selection procedure on the new sample, , with the original selection.

5 Simulation

We evaluate the proposed frameworks for linear mixed models and additive models in two simulation setups. For additive models we investigate the selective distribution of several fixed SDEs after selection via the conditional AIC (see, e.g., Greven and Kneib, 2010), whereas the selection criterion for the linear mixed model setup and either the conditional or marginal perspective is done using a subsequent backward model selection based on significant tests using the package lmerTest (Kuznetsova et al., 2017).

5.1 Mixed Model

We first evaluate the proposed frameworks for linear mixed models, where the selection mechanism is a successive model reduction of fixed effects based on p-values using the package lmerTest (Kuznetsova et al., 2017), keeping the random effects structure fixed (random effects selection will be a focus in 5.2). The true data generating model is given by

for , , fixed effects , random effects , , , three additional noise variables and residual , where is defined via the signal-to-noise ration with the linear predictor vector of all observations in the data generating process. All covariates are drawn independently from a standard normal distribution, yielding a maximum empirical correlation of around 0.17. To investigate the impact of using the true (co-)variance or a plug-in estimator, we compare the true (co)variance (“Truth”), the estimated (co-)variance of the chosen model (“Model Estimate”), the estimated residual variance in the marginal covariance of the corresponding intercept model with fixed random effect structure (“ICM”) and the asymptotically conservative estimate by Tibshirani et al. (2018) (“Var(Y)”). As we keep fixed the random effects in our first simulation, we also investigate the ICM plug-in for the conditional perspective, where we use a model with linear predictor . Note that the derivation by Tibshirani et al. (2018) for the Var(Y) plug-in assumes , which in our simulations only holds for the conditional perspective. We use as many simulation iterations as necessary to obtain at least data sets with selection of the correct or a supmodel. We base p-values for each data set on importance samples.

In Figure 1 the observed pooled p-values for all noise variables (top row) and the two signal variables

(bottom rows) for both conditional and marginal perspective (columns) are plotted against expected quantiles of the standard uniform distribution and compared to the naive p-values. Results indicate that naive p-values exhibit non-uniformity under the null. By contrast, selective p-values show similarly high power for the signal variables for all settings and uniformity under the null. The comparison of variance estimates indicates that while the true variance-covariance matrix yields well-calibrated inference, all estimates yield conservative inference. The inference for the model estimate is similar to that for the truth or even conservative here, while the conservative estimates “ICM” and “Var(Y)” are even more conservative. This finding encourages the use of the model estimate as plugin for the true variance. We also observe that there is no notable difference in using the marginal or conditional perspective in the given settings.

Figure 1: Quantiles of the standard uniform distribution versus the observed p-values for noise variables (top row, pooled across noise variables to ) as well as two signal variables and (bottom rows), each for different SNR as well as for both mixed model perspectives (columns) and different variance estimates used in calculating the p-values (colours). To ensure that the null hypothesis is correct for the noise variables, data points in this plot are based on iterations where the true model was selected or a model in which the true model is nested. Solid lines represent selective p-values, dashed lines represent the corresponding naive p-values not accounting for selection.

For the same simulation setting, we additionally investigate the influence of the choice of matrix for the conditional perspective discussed in Section 3.2.2

. We use the same data generating process as before, but restrict the simulations to the model estimate plugin for the variance and a SNR of 4. We compare the proposed working covariance against a zero matrix to assess the influence of shrinkage on the selective p-values. The corresponding results are given in Appendix 

8.1. Results show uniformity under the Null for both versions with little difference in their distribution for both signal and noise variables. we also did not observe a notable difference in power between the two approaches when reducing the covariate effects to and in the setting with smaller .

5.2 Additive Model Selection

We also evaluate our approach for additive models using the cAIC as selection criterion to select among different additive regression models. The true data generating model for observations is given by

where , , for and . observations of the two signal variables as well as of the two further noise variables are independently drawn from a standard normal distribution. We then add corresponding observations with minus these to values to fulfill the condition by construction. We thereby ensure that covariates with non-linear effects are not affected by sum-to-zero constraints and thus fix the locations where the functions cross zero. We select among five different regression models, where either all covariates are assumed to have a linear effect, only , only , and , or and are estimated as having a non-linear effect. In each of the 500 simulation iterations, these five different models are compared using the cAIC. For selected covariates chosen as having a non-linear effect, point-wise p-values for are calculated for two specific values . This is done for and where the null hypothesis is true for but does not hold for . If the selected covariate is modeled as having a linear effect , the corresponding effect is tested against zero. This is done for and , for which the null hypothesis is true. To investigate the impact of using different variances for sampling, we compare the usage of the true variance (“True”), the estimated variance of the chosen model (“Model Estimate”) and the conservative estimate (“Var(Y)”). For all combinations, we also investigate the difference between using the Bayesian covariance as in (15) and the classical covariance.

In Figure 2 the observed p-values for the two noise variables combined (first row) and the signal variables at the two pre-defined locations (last four rows) are plotted against expected quantiles of the standard uniform distribution. Results indicate that all settings for the true variance or estimated variance and non-Bayesian covariance definition reveal uniform p-values under the null for both noise variables and locations where the non-linear functions cross zero. The difference in power due to becomes apparent when comparing the rows two and four. Using the Bayesian covariance definition, p-values are uniform for the noise variables and tend to be conservative where or are truly zero, while not yielding larger power for the estimated or true variance. These results encourage the use of the classical covariance definition as well as the estimated variance as plugin estimator. When using the conservative estimator Var(Y), results are only notably affected for the classical covariance definition, yielding more conservative results in all simulation settings.

Figure 2: Quantiles of the standard uniform distribution versus the observed p-values for noise variables (top row, pooled for and

) as well as the two locations for the two signal variables (bottom four rows), each for different error standard deviations (sd, different linetypes) and either the Bayesian or classical covariance definition (colours). The usage of different variance estimates for the error variance is visualized in different columns.

6 Application to the Determinants of Inflation

We now apply our proposed framework to the field of monetary economics to analyse the determinants of inflation in a large country sample.

6.1 Introduction

The analysis of country-specific worldwide inflation rates (defined as the percentage changes in the consumer price index) has been subject to many empirical investigations (see, e.g., Catão and Terrones, 2005; Calderón and Hebbel, 2008). Low and stable inflation rates are by now the established (main) goal of monetary policymakers around the world. In order to achieve this objective, a good understanding of the underlying inflation process is crucial for the effectiveness and efficiency of monetary policy. Economic theory proposes a variety of potential drivers of inflation. However, the question remains which economic theory and which corresponding variables provide the most persuasive answer to this question from an empirical point of view. This question has been addressed by Baumann et al. (2020) utilizing an additive mixed model (AMM) approach with an extensive two-stage model-selection procedure based on the cAIC. At first, Baumann et al. (2020) identified eight economic theories that are discussed in monetary economics as explanations of inflation (cf. Appendix 8.2 for a short overview). For each of these theories the corresponding variables could not be assigned unambiguously, because some uncertainty remains about the empirical variables that best approximate the variables motivated by economic theory. For example, the first theory comprises the Output Gap (%) variable and the Real GDP Growth (%) variable, (cf. Table 3 in Appendix 8.3), where economic theory does not give a clear answer on which of these two variables represents the theory best from an empirical point of view. Consequently, a choice of different sets of empirical variables are assigned to each economic theory. These compilations of variables in the various AMMs are purely based on economic theory. The model selection procedure of Baumann et al. (2020) described in Section 6.3 is rather atypical due to missing values of certain predictors, which prohibits a direct comparison of all AMMs at the same stage. The resulting use of different subsets of the data in the course of a hierarchical selection procedure makes it infeasible to provide an analytic form for the selection condition restricting the inference space of . Since the selection procedure is repeatable in a boostrap-like manner, our proposed Monte-Carlo based framework, however, can handle this unusual selection procedure and is thus particularly suitable to derive inference statements for Baumann et al. (2020).

6.2 Data and Model Setup

The setup of Baumann et al. (2020) is as follows: 27 (metric and categorical) predictors and the dependent variable inflation (in percent), denoted by , are given for countries and for consequent years from 1997 to 2015 such that . The vector has been transformed with the natural logarithm after shifting the support to values to avoid numerical instabilities. Due to missing information for some variables,

of the data is generated by imputation.

Baumann et al. (2020) computed their results on the first of five imputations due to the lacking theoretical underpinning for averaging random effect predictions across multiple imputations, but checked for robustness and found stable results with respect to the selection event for all five imputations. Future research might strive to incorporate imputation uncertainty into selective inference statements.

The generic AMM used to explain by a set of predictors is given in Equation (16). Each of the eight economic theories is represented by a set , containing sets of predictors . Each is composed of disjunct subsets and of predictors with linear and non-linear effects, respectively, as well as pairs of variables in and pairs of variables in with linear and non-linear interaction effects, respectively. Non-linear effects of predictors are estimated by univariate cubic P-Splines (Eilers and Marx, 1996) with second-order difference penalties. Interaction effects of pairs of variables in

are modeled using penalized bivariate tensor-product splines. The assignment to

, , and can be found in Tables 3 and 4 in Appendix 8.3. Each model corresponding to one is of the following form:

(16)

with , where a random intercept and a random slope with design vector and non-diagonal covariance are (always) included to capture the serial within-country correlation. Further, is assumed with , where is a diagonal matrix with potentially heterogeneous country-specific variances on its diagonal. In total, there are 90 (= ) such models for all predictor sets comprised by each economic theory . For each there is one set of models which includes all corresponding .

6.3 Model Selection Procedure

The model selection procedure of Baumann et al. (2020) is as follows: at a first stage , a winner model with the lowest cAIC among models in the set is selected per theory. At a second-stage , , , are collected in the set . Some predictors associated with , and are not imputed as these predictors are restricted in availability either across time and/or countries which makes a direct model comparison by means of the Likelihood and thus the cAIC inadmissible. As a result, if the predictor sets included in , and are only available for a subsample of data, they are instead added to to be compared to the AMM with the lowest cAIC in later. The winner has the lowest cAIC in the set of models and its cAIC is finally compared to each on the corresponding different data subsets to yield the overall winner . If the computation of any AMM on any subset of the data fails, this AMM is assigned the highest cAIC in the given comparison. This can happen in particular for complex models on smaller subsets of the data. First- and second-stage selection are together labeled . represents the model with the highest empirical relevance for the application

Based on the described model selection procedure Baumann et al. (2020) obtained among other results, , and . Note that is estimated on a subsample of 80 countries and and on the full sample of 124 countries. We now provide inference statements for partial effect estimates within , and taking into account the model selection uncertainty from the first- and second-stage. Inference statements for condition on model selection in both stages, i.e. on , to account for uncertainty in the empirical variables capturing each economic theory best.

6.4 Results

Figure 3: The left panel shows the partial effect estimate included in , and the right panel the partial effect estimate included in . The standard pointwise 95%-Bayesian confidence intervals are shaded and the pointwise selective 95%-Bayesian confidence intervals are shown as dots, connected using a LOESS estimate indicated by solid lines.
Results for model .

We pick 12 locations on the estimated spline function , which represents the partial effect estimate of the Real GDP per capita (USD) variable, labeled GDP pc (USD), onto log-inflation. We provide 95% (point-wise) confidence intervals for each projection based on our proposed selective inference procedure with . We compute the corresponding test vector, draw test statistics from a mixture of Gaussian distributions as proposal distribution in the importance sampling and compute inference results as described in Section 2.3. The mixture distribution with different expectations as proposal distribution is based on empirical observations, yielding a substantial larger amount of usable samples in this application.

2324.2 3300.4 4538.7 4711.0 6199.9 7852.4 8430.3 8987.9 10999.6 13001.1 13470.4 14023.6
11% 2% 7% 14% 26% 27% 27% 27% 25% 6% 8% 4%
18% 6% 20% 22% 26% 26% 26% 26% 25% 19% 15% 25%
31% 30% 32% 30% 24% 25% 25% 25% 25% 33% 35% 35%
41% 62% 41% 34% 23% 22% 22% 22% 25% 42% 41% 36%
Total 759 478 823 1027 1345 1386 1397 1401 1292 868 908 1045
p-value 0.184 0.034 0.118 0.128 0.438 0.043 0.022 0.016 0.905 0.000 0.000 0.000
Table 1: Share of samples that have led to the original selection for each , separated by each component of the Gaussian mixture distribution (leftmost column), with mean values being a ratio of the observed test statistic. The percentages add up to 100% which is equal to Total. The last row contains the p-values that result from testing the null hypothesis.

For the partial effect estimate in , the standard 95%-Bayesian confidence intervals and the (in part wider) selective confidence intervals for each projection are shown in the left panel of Figure 3. Corresponding p-values are given in the last row of Table 1. Details on the run-time for the computation are given in Appendix 8.2.

Results for model .

We pick 12 locations on the estimated spline function , which represents the estimated partial effect of the variable Trade Openness (% of GDP), labeled Trade Open. (% GDP), on log-inflation. We compute selective 95% (point-wise) confidence intervals for each projection using importance sampling analogous to those discussed for above (but using for computational reasons). Results are visualized in the right panel of Figure 3. Selective pointwise tests for were significant at at all tested locations besides 80 and 362.8, even after adjusting for model selection (cf. Table 2 in Appendix 8.3).

Results for model .

Effects of four variables are modelled through three separate model terms which are included in the selected (cf. Table 4 in Appendix 8.3). Inference statements for the linear effect of Real GDP Growth (%) and the non-linear univariate effect of Credit (% of GDP) Growth can be obtained in the same manner as done for and . We here focus on the non-linear interaction effect of the two remaining variables, Energy Prices (USD) and Energy Rents (% of GDP), modelled using a penalized bivariate tensor-product spline. We conduct an overall test for the interaction surface, that is

for all pairs (Energy Prices (USD), Energy Rents (% of GDP)). We therefore use the selective -significance test for groups of variables, which is applied to the group of spline coefficients. We use importance sampling analogous to those discussed for and using . Based on 80 samples that are consistent with the original model selection, we compute p-values. These suggest that the null hypothesis can be rejected at the pre-defined significance level (cf. Appendix 8.2 for details).

7 Discussion

In this work we discuss extensions of selective inference to linear mixed models and additive (mixed) models. We establish new test statistics, hypotheses of interest and working model assumptions and introduce a conditional perspective for this working model. We show how recent proposals for selective inference in linear models can be transferred to these larger model classes and provide evidence of the validity of our approach using simulation studies. We also address the issue of known (co)variance assumptions and give practical advice on how to apply the proposed concepts in practice.

The application of the proposed approach to the determinants of inflation underlines the usefulness of our approach by allowing to compute p-values and confidence intervals for additive models with non-i.i.d. errors after a non-trivial multi-stage selection procedure including different model types, missing values and varying numbers of observations for the different model comparisons.

The proposed Monte Carlo approximation allows for arbitrary selection mechanisms as long as is deterministic on a given data set, yielding statistically valid inference. The approximation is, however, computationally expensive. In particular for additive models, where selective confidence intervals for non-linear functions require reruns of for each point on the function that is to be tested, future research is needed to circumvent this computational bottleneck.

References

  • Baumann et al. (2020) Baumann, P., Rossi, E. and Volkmann, A. (2020) What Drives Inflation and How: Evidence from Additive Mixed Models Selected by cAIC. arXiv e-prints arXiv:2006.06274.
  • Berk et al. (2013) Berk, R., Brown, L., Buja, A., Zhang, K., Zhao, L. et al. (2013) Valid post-selection inference. The Annals of Statistics, 41, 802–837.
  • Calderón and Hebbel (2008) Calderón, C. and Hebbel, K. S. (2008) What drives inflation in the world? Working Papers Central Bank of Chile 491, Central Bank of Chile.
  • Catão and Terrones (2005) Catão, L. A. and Terrones, M. E. (2005) Fiscal deficits and inflation. Journal of Monetary Economics, 52, 529–554.
  • Eilers and Marx (1996) Eilers, P. H. and Marx, B. D. (1996) Flexible smoothing with b-splines and penalties. Statistical science, 89–102.
  • Fithian et al. (2014) Fithian, W., Sun, D. and Taylor, J. (2014) Optimal Inference After Model Selection. arXiv e-prints arXiv:1410.2597.
  • Greven and Kneib (2010) Greven, S. and Kneib, T. (2010) On the behaviour of marginal and conditional AIC in linear mixed models. Biometrika, 97, 773–789.
  • Kuznetsova et al. (2017) Kuznetsova, A., Brockhoff, P. B. and Christensen, R. H. B. (2017) lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82, 1–26.
  • Lee et al. (2016) Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2016) Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44, 907–927.
  • Loftus and Taylor (2015) Loftus, J. R. and Taylor, J. E. (2015) Selective inference in regression models with groups of variables. arXiv e-prints arXiv:1511.01478.
  • Marra and Wood (2012) Marra, G. and Wood, S. N. (2012) Coverage properties of confidence intervals for generalized additive model components. Scandinavian Journal of Statistics, 39, 53–74.
  • Nychka (1988) Nychka, D. (1988) Bayesian confidence intervals for smoothing splines. Journal of the American Statistical Association, 83, 1134–1143.
  • Overholser and Xu (2014)

    Overholser, R. and Xu, R. (2014) Effective degrees of freedom and its application to conditional aic for linear mixed-effects models with correlated error structures.

    Journal of multivariate analysis

    , 132, 160–170.
  • Rügamer and Greven (2018) Rügamer, D. and Greven, S. (2018) Selective inference after likelihood- or test-based model selection in linear models. Statistics & Probability Letters, 140, 7 – 12.
  • Rügamer and Greven (2020) — (2020) Inference for L2-Boosting. Statistics and Computing, 30, 279–289.
  • Ruppert et al. (2003) Ruppert, D., Wand, M. P. and Carroll, R. J. (2003) Semiparametric regression. Cambridge series in statistical and probabilistic mathematics. Cambridge and New York: Cambridge University Press.
  • Säfken et al. (2014) Säfken, B., Kneib, T., van Waveren, C. and Greven, S. (2014) A unifying approach to the estimation of the conditional akaike information in generalized linear mixed models. Electronic Journal of Statistics, 8, 201–225.
  • Säfken et al. (2019) Säfken, B., Rügamer, D., Kneib, T. and Greven, S. (2019) Conditional Model Selection in Mixed-Effects Models with cAIC4. Journal of Statistical Software. To appear.
  • Tibshirani et al. (2018) Tibshirani, R. J., Rinaldo, A., Tibshirani, R. and Wasserman, L. (2018) Uniform asymptotic inference and the bootstrap after model selection. The Annals of Statistics, 46, 1255–1287.
  • Tibshirani et al. (2016) Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2016) Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111, 600–620.
  • Wood (2011) Wood, S. N. (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B), 73, 3–36.
  • Wood (2013) — (2013) On p-values for smooth components of an extended generalized additive model. Biometrika, 100, 221–228.
  • Wood (2017) — (2017) Generalized Additive Models: An Introduction with R. CRC press.
  • Yang et al. (2016) Yang, F., Barber, R. F., Jain, P. and Lafferty, J. (2016) Selective inference for group-sparse linear models. In Advances in Neural Information Processing Systems, 2469–2477.

8 Appendix

8.1 Further Simulation Results

Figure 4: Comparison of expected quantiles of p-values vs. observed (selective) p-values when either using (Working Covariance) or (Covariance w/o matrix A) in the definition of the test vector.

8.2 Further Details on Section 6

Theories of inflation.

The first theory regards the creation of monetary base, credit growth and the Philips Curve. The second theory comprises countries’ institutional setup. The third accounts for a group of variables that represent monetary policy strategies while the fourth focuses on public finances. The fifth theoretical explanation is related to globalization, the sixth takes into account demographic changes and the seventh the development of prices and rents of natural resources. The last theory considers the inertia of the inflation process.

Details on model fitting.

For model fitting we use the R-package mgcv (Wood, 2011) and obtain model selection via the R-package cAIC4 (Säfken et al., 2019). To allow this, we extend the cAIC4 package to permit calculations for 1) mixed and additive models estimated by mgcv and 2) models beyond i.i.d. error covariance following the proposed extension of Overholser and Xu (2014). As Overholser and Xu (2014) do not take into account the estimation uncertainty of , we implemented a working version that adds the number of unknown parameters in (asymptotically justified by Overholser and Xu (2014)) to the bias correction term of Säfken et al. (2019). Further research might strive for an analytic solution to the bias correction term.

Run-time to derive inference statements for .

The run-time for the estimation of the 16 AMMs in followed by 16 cAIC computations for a single new response vector on a cluster of 34 cores (Intel(R) Xeon(R) CPU E3-1284L v4 @ 2.90GHz) was approximately 1:05min. In order to obtain inference statements for a single the estimation of all AMMs and cAICs associated with each newly generated response vector for all samples summed to a total run-time of 26h and 59min.

Test of the bivariate Interaction Effect of :

In the case of this bivariate interaction effect, is given by where is the design matrix comprising the evaluated spline basis functions associated with Credit (% GDP) Gr. and with GDP Gr. (%). comprises the evaluated basis functions of the bivariate tensor-product spline. We test against its one sided alternative , corresponding to an overall test for the interaction surface. The test statistics is where is the observed realization of . As tests for groups of variables are currently only available for unpenalized effects, the degrees of freedom of our selective -significance test were chosen according to the rank of , which might lead to a loss in power due to the penalization of the basis coefficients (cf. Wood, 2013). By means of the Monte Carlo approximation described in Section 2.3, we obtained for . We obtained 80 , for which with