Conformal prediction for exponential families and generalized linear models

05/09/2019
by   Daniel J. Eck, et al.
0

Conformal prediction methods construct prediction regions for iid data that are valid in finite samples. Distribution-free conformal prediction methods have been proposed for regression. Generalized linear models (GLMs) are a widely used class of regression models, and researchers often seek predictions from fitted GLMs. We provide a parametric conformal prediction region for GLMs that possesses finite sample validity and is asymptotically of minimal length when the model is correctly specified. This parametric conformal prediction region is asymptotically minimal at the √((n)/n) rate when the dimension d of the predictor is one or two, and converges at the O{((n)/n)^1/d} rate when d > 2. We develop a novel concentration inequality for maximum likelihood estimation in exponential families that induces these convergence rates. We analyze prediction region coverage properties, large-sample efficiency, and robustness properties of four methods for constructing conformal prediction intervals for GLMs: fully nonparametric kernel-based conformal, residual based conformal, normalized residual based conformal, and parametric conformal which uses the assumed GLM density as a conformity measure. Extensive simulations compare these approaches to standard asymptotic prediction regions. The utility of the parametric conformal prediction region is demonstrated in an application to interval prediction of glycosylated hemoglobin levels, a blood measurement used to diagnose diabetes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/02/2020

Conformal prediction intervals for the individual treatment effect

We propose several prediction intervals procedures for the individual tr...
05/28/2021

Bayes-optimal prediction with frequentist coverage control

This article illustrates how indirect or prior information can be optima...
07/22/2019

Asymptotic normality, concentration, and coverage of generalized posteriors

Generalized likelihoods are commonly used to obtain consistent estimator...
04/28/2021

Finite-sample Efficient Conformal Prediction

Conformal prediction is a generic methodology for finite-sample valid di...
07/24/2020

CD-split: efficient conformal regions in high dimensions

Conformal methods create prediction bands that control average coverage ...
05/16/2020

Conformal Prediction: a Unified Review of Theory and New Challenges

In this work we provide a review of basic ideas and novel developments a...
10/11/2021

Nonparametric Functional Analysis of Generalized Linear Models Under Nonlinear Constraints

This article introduces a novel nonparametric methodology for Generalize...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vovk et al. (2005)

introduced conformal prediction to construct finite sample valid prediction regions for predictions from nearest-neighbor methods, support-vector machines, and sequential classification and both linear and ridge regression problems. The goal in conformal prediction is to construct a prediction region

from an iid random sample , ,

, such that the probability of a future observation

belonging to the region exceeds a desired coverage level (Shafer and Vovk, 2008). That is, given , we seek a set such that for ,

(1)

Lei et al. (2013) extended the framework of Shafer and Vovk (2008) to provide a framework for which the prediction region is also asymptotically of minimal length. These prediction regions make use of a conformity measure which measures the agreement of a point with a probability measure . For example, when

, the probability density function

corresponding to is a conformity measure . Lei et al. (2013)

proposed nonparametric kernel density estimation of

to construct conformal prediction regions that are asymptotically minimal under smoothness conditions on . Beyond real-valued outcomes, conformal methods have been proposed for functional data (Lei et al., 2015), nonparametric regression (Lei and Wasserman, 2014), time series data (Chernozhukov et al., 2018b), regression problems (Lei et al., 2018), random effects (Dunn and Wasserman, 2018), causal inference (Chernozhukov et al., 2018a)

, and machine learning

(Vovk et al., 2005; Gammerman and Vovk, 2007; Papadopoulos et al., 2011; Vovk, 2012; Burnaev and Vovk, 2014; Balasubramanian et al., 2014; Johansson et al., 2018; Wang et al., 2018). The finite sample validity property of conformal prediction regions has broad appeal in many scientific domains, including astrophysics (Ciollaro et al., 2018), medical applications (Lambrou et al., 2009; Devetyarov et al., 2012; Eklund et al., 2015; Bosc et al., 2019), genetics (Norinder et al., 2018b, a), and chemistry (Cortés-Ciriano and Bender, 2018; Ji et al., 2018; Svensson et al., 2018a, b; Toccaceli et al., 2020).

Generalized linear models (GLMs) are widely used regression models for outcomes that follow an exponential family distribution (McCullagh and Nelder, 1989)

. GLMs are popular in empirical research in the biomedical and social sciences; procedures for fitting GLMs is incorporated into every major statistical software package. Often predictive inference from a fitted GLM serves the primary scientific goals of the analysis. Point predictions from these models are usually combined with variance estimates from the bootstrap or delta method to construct prediction intervals. For example, interval predictions on the outcome scale may be constructed using the

predict.glm function in R (R Core Team, 2019) to construct Wald type intervals for either the mean response or its distribution.

In this paper, we are interested in prediction intervals for the outcome random variable itself, not just the mean. Minimal length prediction regions can be constructed directly using the assumed GLM, but nominal coverage for these regions is not guaranteed in finite samples, and asymptotic claims require the GLM being to be correctly specified. In this article we show that conformal prediction regions achieve nominal coverage in finite samples for GLMs, even when the model is misspecified. Our parametric conformal prediction region estimated under the correct GLM with the link function specified is conditionally valid in local regions of the predictor space

(Lei and Wasserman, 2014, pg. 72), and is asymptotically conditionally valid for point predictions. Furthermore, we show that when the predictor space is bounded, this parametric conformal region is asymptotically of minimal length. The rate of convergence is when the dimension of predictor space is and is when where the predictor region of interest shrinks at a suitable rate. Similar results are suggested by Dunn and Wasserman (2018) in a general context involving random effects, but specific convergence rates are not provided. The rate is a consequence of how quickly the predictor region of interest shrinks. This convergence rate is faster than that presented by Lei and Wasserman (2014) which gave rates of for nonparametric regression predictions.

In an extensive simulation study, we analyze marginal, local, and conditional prediction interval coverage, and large-sample efficiency of four methods for constructing conformal prediction intervals from fitted GLMs: fully nonparametric kernel-based prediction (Lei and Wasserman, 2014), prediction for residuals and locally weighted residuals (Lei et al., 2018), and prediction using the parametric density as conformity measure. We find that when sample sizes are moderate – large enough to estimate regression coefficients reasonably precisely, but before asymptotic arguments guarantee good conditional coverage – conformal prediction methods outperform traditional methods in terms of finite sample marginal, local, and conditional coverage.Importantly, we find that parametric conformal prediction regions perform extremely well under mild model misspecification, they are calibrated to give valid finite sample coverage and they are not too large for meaningful inference in this setting. We demonstrate the utility of conformal prediction methods for GLMs in an application to diabetes diagnosis via interval prediction of glycosylated hemoglobin levels of subjects participating in a community-based study (Willems et al., 1997).

2 Background

2.1 Conformal prediction for outcomes

The basic intuition underlying the conformal prediction method is that, given an independent sample from distribution defined on , the iid hypothesis that , , , using observation , , , is tested for each , and then is created by inverting this test (Lei et al., 2013; Vovk et al., 2005; Shafer and Vovk, 2008). More formally, suppose that , , , . Let be the corresponding empirical distribution and note that is symmetric in its arguments. Define

where , is a conformity measure which measures the “agreement” of a point with respect to a distribution (Lei et al., 2013). Informally, a conformity measure should be large when there is agreement between and . One important example of a conformity measure is the density function of (when it exists); this idea is used in Section 3.2. Exchangeability of follows from exchangeability of the random variables . Define as the random variable evaluated at . The conformal prediction region for is

(2)

where Then by construction,

(3)

which implies that Any conformity measure can be used to construct prediction regions with finite sample validity.

Lemma 1.

(Lei et al., 2013). Suppose , , , is an independent random sample from . Then , for all probability measures , and hence is valid.

Figure 1 shows schematically how this conformal prediction region is constructed.

Figure 1: How conformal prediction regions are constructed. The left panel shows conformal prediction for a univariate outcome : the top row shows the data points and corresponding histogram; middle row shows conformity scores for a nonparametric kernel smoother; bottom panel shows for each , and the conformal prediction set . The right panel show conformal prediction for a univariate regression with predictor and outcome : the top row shows the pairs, and a regression estimate computed using the additional point ; middle panel shows the conformity scores obtained by using the estimated density as the conformity measure; bottom panel shows , the proportion of density scores lower than the density score at candidate , and the conformal prediction set .

2.2 Notions of finite sample validity in regression

Several notions of finite sample validity exist in the context of regression. Suppose that we have an iid sample , , , where the predictor is and is the outcome. We will suppose that of the ’s are main effects, where , and the other terms in are functions of the main effects. We will suppose that has support .

Definition 1 (marginal validity).

Let be a desired error tolerance. Let , , , be an iid sample from a continuous distribution . The prediction region has finite sample marginal validity if

where .

Finite sample marginal validity is guaranteed for conformal prediction methods by construction. Finite sample marginal validity alone may not be desirable when variability of the outcome is not constant across the support of the predictor. A second notion of finite sample validity arises by considering coverage of the prediction region conditional on a particular value of the predictor.

Definition 2 (conditional validity).

Let be a desired error tolerance. Let , , , be an iid sample from a continuous distribution . The prediction region has finite sample conditional validity at when

where .

However, conditional validity at every is unattainable if we insist that is asymptotically of minimal length. Let be the conditional density corresponding to regression of on . Define to be the value such that and define to be the optimal conditional prediction region (i.e., the minimal length prediction region). Let be Lebesgue measure and let denote the symmetric set difference operator. Then there does not exist a finite sample conditional valid prediction region such that holds, a result due to Lemma 1 of Lei and Wasserman (2014). A third notion of finite sample validity relaxes the stringent requirement of conditional validity at every possible .

Definition 3 (local validity).

Let be a desired error tolerance. Let , , , be an iid sample from a continuous distribution . Let be a partition of . We say that the prediction region has finite sample local validity when

for all where .

Local validity offers a bridge between marginal validity – which is inappropriate in the presence of heterogeneity – and conditional validity – which is unattainable when we require that the prediction interval is asymptotically minimal (Lei and Wasserman, 2014). The nonparametric conformal prediction region in Section 2.3.1 satisfies finite sample local validity and is asymptotically conditional valid (Lei and Wasserman, 2014). In Section 3.2 we show that the parametric conformal prediction region also satisfies finite sample local validity and is asymptotically conditional valid at rate faster than its nonparametric counterpart.

2.3 Existing conformal prediction techniques in regression

For convenience in what follows, suppose the support of the predictors is . Let and let , , , be an iid sample where is a continuous distribution, , with main effects. We will denote the true conditional density of the regression model as where is the regression coefficient vector and is a vector of nuisance parameters. For each , we define to be the point such that . The optimal or minimal length prediction region at is .

2.3.1 The nonparametric kernel estimator

Lei and Wasserman (2014) proposed a nonparametric conformal prediction region for continuous outcomes that satisfies finite sample local validity. Partitioning of the predictor space is performed so that the nonparametric conformal prediction region within each partition achieves finite sample marginal validity. Let

be a partition of the support of the predictor variable

. The number of elements in this partition increases at rate where is a rate that depends on the smoothness of the underlying density (Lei and Wasserman, 2014). A specific rate is discussed at the end of this Section.

When we restrict the partition to be formed of equilateral cubes, we let the widths of these cubes decrease at rate . Let . Let be a non-negative kernel function. The estimated local marginal density of is

where is the bandwidth. The density estimate given a new pair , is

The local conformity rank is then

and the conformal prediction band is

(4)

for all . The intuition for constructing the conformal prediction region in this manner is that, asymptotically, the width shrinks so that the conditional distributions for all become very similar to that of for any particular . Lei and Wasserman (2014) provided that rate of convergence for which is asymptotically of minimal length. In the Supplementary Materials we show that when the underlying model is a GLM, is asymptotically of minimal length at rate .

2.3.2 Normalized residuals

Lei et al. (2018) proposed a prediction region obtained from conformal prediction for residuals, which is appropriate when errors are symmetric about the mean function . Lei et al. (2018, Section 5.2) also proposed an extension of their conformal prediction procedure that is appropriate when the errors about exhibit heterogeneity but remain symmetric. This extension involves a dispersion function that captures the changing variability across and is used to weight the residuals so that theses weighted residuals have the same magnitude, on average, across . The conformal procedure of Lei et al. (2018, Section 5.2), denoted as the “least squares locally weighted” method, proceeds as follows. When the mean regression estimator of is a symmetric function of the data points, augment the original data with a new point , and estimate the mean function and the dispersion function with respect to the augmented data. Define the normalized absolute residuals as

for . The normalized absolute residual is an example of an anti-conformity measure, in which a low value of indicates agreement between and the estimated regression function. As in Lei et al. (2018), we specify the dispersion function as the conditional mean absolute deviation of as a function of . Let

be the proportion of points with normalized residuals smaller than the proposed normalized residual . Define the least squares locally weighted (LSLW) conformal prediction regions as

The least squares (LS) conformal prediction region is constructed as in with no weighting with respect to residuals, . These conformal prediction regions achieve finite sample marginal validity as a consequence of exchangeability of the data and symmetry of in its arguments (Lei et al., 2018).

3 Conformal prediction for generalized linear regression

We now introduce parametric conformal prediction regions for exponential family distributions and a class of GLM with continuous outcomes. We show that the parametric conformal prediction region possesses the same finite sample local validity properties as its nonparametric counterpart, but its convergence to the true minimum length prediction region is faster. In the Supplementary Materials, we verify that the nonparametric conformal region of Lei and Wasserman (2014) is appropriate for the class of GLMs that we consider.

3.1 Exponential family distributions and generalized linear models

Let be a positive Borel measure on a finite-dimensional vector space that is absolutely continuous with respect to Lebesgue measure . The log Laplace transform of is the function defined by

(5)

where is the dual space of , is the inner product which places and in duality, and is the extended real number system, which adds the values and to the real numbers (Rockafellar and Wets, 1998, Section 1.E). We will let . The effective domain of on is For every , the function defined by

(6)

is a probability density with respect to . We refer to and as the canonical statistic and canonical parameter respectively. The set where is any nonempty subset of , is called an exponential family of densities with respect to . This family is full if . We refer to as the generating measure of the exponential family. When is taken to be the generating measure, the density (6) becomes,

(7)

where is a parameter-free continuous function of the canonical statistic. We will only consider scalar outcome variables. However

can occur in this case, the normal distribution being an example with canonical statistic

. Thus we will assume that the canonical statistic vector lies on a one-dimensional manifold of .

When the density (6

) corresponds to a generalized linear regression model we will re-parameterize

, where is continuous in both arguments, the vectors and are a predictors and regression coefficients respectively, and is a vector of nuisance parameters. We specify that the base set (main effects) of predictors have support , . Here we assume, without loss of generality, that where is a link function and is the first component of the canonical statistic vector. As an example of this re-parameterization, consider the simple linear regression model with homoskedastic normal errors with variance given by . In this example , , and . The link function is taken to be the identity function and .

The re-parameterized density corresponding to the generalized linear regression model is then

(8)

with respect to generating measure . We will further assume that is bounded and that the exponential family is full with parameter space given by,

(9)

All of the continuous exponential families implemented in the glm function in R have densities that can be parameterized as (8). Again consider the multiple linear regression model with homoskedastic normal errors with variance given by . Here we have and . When is taken to be the generating measure, the density (8) becomes

(10)

3.2 Parametric conformal prediction regions

We now assume that the link function of the generalized linear regression model is known, the model is correctly specified, and that is a -consistent estimator of the regression coefficients . Let be a partition of the support of the predictor variable in the form of equilateral cubes with sides of length when and when . Let be the true conditional density function and let be the estimated density fit using the original data and define as the estimated density fit to the augmented data . We will use this estimated density as our conformity measure.

The local conformal prediction region is then

(11)

where for . We show that the region in (11) has finite sample local validity and is asymptotically minimal for all generalized linear models that can be parameterized as (8). A schematic of how this conformal prediction region is constructed is displayed in the right-hand side of Figure 1.

Lemma 2.

Let be a partition of and let be as defined in (11). Then is finite sample locally valid for .

Proof.

The proof follows from the proof of (Lei and Wasserman, 2014, Proposition 2). Fix and let , , , . Let , be another independent sample. Let and for all . Then conditioning on and , , , the sequence , , , is exchangeable. ∎

We now establish that the parametric conformal prediction region is asymptotically of minimum length over .

Assumption 1 (Assumption 1 (a) of Lei and Wasserman (2014)). Let be the support of and assume that . The marginal density of satisfies for all .

Theorem 1.

Let , , , , , , , be an independent and identically distributed sample of random variables with conditional density (10) and parameter space (9). Assume that and that the canonical statistic vector is a one dimensional manifold. Suppose that Assumption 1 holds. Let and be the prediction band given by (11). Then, for a given , there exists a numerical constant such that

(12)

where .

The proof of Theorem 1 is given in the Appendix. The rates and are appreciably faster than that of the nonparametric conformal prediction band which has a convergence rate of when the underlying is an exponential family. The difference between the convergence speed of the parametric and nonparametric conformal prediction regions originates from the differences in the rates of MLEs and nonparametric techniques and the speed at which the bin widths shrink. The following Lemma governs how fast the bins can shrink while still maintaining that .

Lemma 3 (Lemma 9 in Lei and Wasserman (2014)).

Under Assumption 1, there exists constants and such that

with and defined in Assumption 1.

The key term in Lemma 3 is the term appearing in the exponent. The proof of Theorem 1 reveals that the rate of convergence of the parametric conformal prediction region is limited by Lemma 3. This is not the case for the nonparametric conformal prediction region where the limiting factor is the rate of convergence of the kernel density estimator.

4 Simulation study

We consider three simulation settings to evaluate the performance of conformal prediction regions under correct model specification and model misspecification. These simulation settings are:

  • Gamma regression with and . Data are generated from a Gamma regression model, and the parametric conformal and highest density prediction regions are correctly specified. A cubic regression model is assumed for the least squares and least squares locally weighted conformal prediction regions.

  • Gamma regression with and . Data are generated from a Gamma regression model, and a cubic regression model with homoskedastic normal errors is assumed for the highest density prediction region and the misspecified parametric, least squares, and least squares locally weighted conformal prediction regions.

  • Simple linear regression with , and normal errors with constant variance . Results are considered for sample sizes . In this setting the regression model is correctly specified for the highest density prediction region and the parametric, least squares, and least squares locally weighted conformal prediction regions.

Following the bin width asymptotics of Lei and Wasserman (2014), the number of bins used to form the parametric and nonparametric conformal prediction regions is 2 when and 3 when . All simulations correspond to univariate regressions and the predictor variables were generated as . Figure 2 shows the four example conformal regions (LS, LSLW, nonparametric, and GLM parametric) for simulation setting A, B, and C.

Figure 2: Illustration of conformal prediction regions. The top, middle, and bottom rows correspond to simulation setting A with shape parameter equal to 1, simulation setting B with shape parameter equal to 10, and simulation setting C respectively. The first column shows the parametric conformal prediction region which is misspecified in row 2, the second column the least squares locally weighted conformal prediction region, the third column the nonparametric conformal prediction region, and the fourth column the least squares conformal prediction region.

Several diagnostic measures are used to compare conformal prediction regions. These diagnostic measures compare prediction regions by their prediction error, volume, and coverage properties. Our prediction error diagnostic metric will be an average of the squared distances of observations outside of the prediction region to the closest boundary of the prediction region, averaged over all observations. An observation that falls within the prediction region has an error of 0. More formally this prediction error metric is

where and are, respectively, the lower and upper boundaries of possible disjoint intervals forming the prediction region.

The volume of each prediction region will be estimated by the average of the upper boundary minus the lower boundary across observed , written as

To assess finite sample marginal validity we calculate the proportion of responses that fall within the prediction region. To assess finite sample local validity with respect to binning we first bin the predictor data and then, for each bin, we calculate the proportion of responses that fall within the prediction region. The same procedure is used to assess finite sample conditional validity, but we use a much finer binning regime than what was used to assess finite sample local validity.

Figure 3: Area, prediction error, and bin-wise coverage for parametric, nonparametric, least squares, least squares locally weighted conformal prediction region, and the highest density prediction region for Gamma GLM regression with and . Simulation setting A is shown at left, and setting B at right. The average of 250 samples at each shape parameter value in these simulation settings form the lines that are depicted in both plots of this figure.

In our simulations, we find that the parametric conformal prediction region performs well even when the model is moderately misspecified. By construction, this region, along with the nonparametric conformal prediction region, maintains finite sample local validity with respect to binning as seen in the bottom row of both plots in Figure 3 and Table 1. However, the parametric and nonparametric conformal prediction regions are have different shape conditional on , as seen in Figure 2, and give different prediction errors as seen in the top row of both plots in Figure 3 and Table 1. The parametric conformal prediction region adapts naturally to the data when the model is correctly specified or when modest deviations from the specified model are present. Large deviations from model misspecification are not handled well as seen in the top row of the right hand side of Figure 3, although the parametric conformal prediction region does give nominal marginal and local coverage with respect to binning in finite samples as seen in the bottom row of the right hand side of Figure 3. On the other hand, the nonparametric conformal prediction region does not adapt well to data obtained from a Gamma regression model or data obtained from a linear regression model with a steep mean function.

Parametric Nonparametric LS LSLW HD
Diagnostic conformal conformal conformal conformal region
marginal coverage
local coverage
local coverage
area
prediction error
marginal coverage
local coverage
local coverage
local coverage
area
prediction error
marginal coverage
local coverage
local coverage
local coverage
area
prediction error
Table 1: Local and marginal coverage (), prediction region area, and prediction error for conformal prediction regions for linear regression models with normal errors and constant variance. The values in this table are the average of 50 samples.

The least squares conformal prediction region obtains marginal validity (Lei et al., 2018) but performs poorly when deviations about the estimated mean function are either not symmetric, not constant, or both. When heterogeneity is present, the least squares conformal prediction region exhibits under-coverage in regions where variability about the mean function is large and over-coverage in regions where variability about the mean function is small. Clear evidence of these features are seen in Figure 3 and 2. This conformal prediction region is very sensitive to model misspecification. The least squares locally weighted conformal prediction region also obtains marginal validity (Lei et al., 2018, Section 5.2) and it is far less sensitive to model misspecification than the least squares conformal prediction region and it performs well under modest model misspecification. However, the least squares locally weighted conformal prediction region is not appropriate when deviations about an estimated mean function are obviously not symmetric, as evidenced in the top row of Figure 2. Results from additional simulations corresponding to settings A and B are provided in the Supplementary Materials. The findings from these additional simulations are consistent with the conclusions of the simulations presented in this Section.

5 Predicting the risk of diabetes

Figure 4: Conformal prediction regions for glycosylated hemoglobin projected onto the age and weight predictor axes. Upper and lower bounds of the conformal prediction region are loess smoothed for visual appearance.

Diabetes is a group of metabolic diseases associated with long-term damage, dysfunction, and failure of different organs, especially the eyes, kidneys, nerves, heart, and blood vessels (American Diabetes Association, 2010). In 2017 approximately 5 million adult deaths worldwide were attributable to diabetes; global healthcare expenditures on people with diabetes are estimated USD 850 billion (Cho et al., 2018). Diabetes remains undiagnosed for an estimated 30% of the people who have the disease (Heikes et al., 2008). One way to address the problem of undiagnosed diabetes is to develop simple, inexpensive diagnostic tools that can identify people who are at high risk of pre-diabetes or diabetes using only readily-available clinical or demographic information (Heikes et al., 2008).

We examine the influence of several variables on blood sugar, or glycosylated hemoglobin percentage (also known as HbA1c), an important risk factor for diabetes. A glycosylated hemoglobin value of 6.5% can be used as a cutoff for positive diagnosis of diabetes (World Health Organization, 2011). We predict an individual’s glycosylated hemoglobin from their height, weight, age, and gender, all of which are easy to measure, inexpensive, and do not require any laboratory testing. The data in this analysis come from a population-based sample of 403 rural African-Americans in Virginia (Willems et al., 1997), taken from the faraway R package (Faraway, 2016). We considered a gamma regression model that only includes linear terms for each covariate, a linear regression model with homoskedastic normal errors and the same linear terms for each covariate, and a linear regression model with homoskedastic normal errors that also included quadratic terms for each covariate. Of these considered the models, the gamma regression model fit the data best on the basis that it had the lowest AIC value and it gave the best predictive predictive performance on the basis that it had the lowest sum of squares prediction error. That being said, we do not know the data generating process.

Based on these covariates, conformal prediction regions provide finite sample valid prediction regions for glycosylated hemoglobin that may be useful for diagnosing diabetes in this study population. Four conformal prediction regions are considered for predicting glycosylated hemoglobin percentage. These conformal prediction regions are the parametric conformal prediction region with a Gamma model fit, the parametric conformal prediction region with a Gaussian model fit, the least squares conformal prediction region, and the least square conformal prediction region with local weighting. All conformal prediction regions correspond to models that only include linear terms for each of the covariates. The Gamma and Gaussian parametric conformal prediction regions were computed with binning across the binary gender factor variable, the predictor space is partitioned across genders. However, no additional binning structure within the levels of gender was employed.

Gamma Gaussian LSLW LS
conformal conformal conformal conformal
marginal coverage 0.909 0.906 0.906 0.909
volume 7.349 7.730 8.983 7.601
pred error 0.931 0.849 0.744 0.883
avg. cond. coverage 0.874 0.849 0.889 0.860
Table 2: Diagnostics for prediction regions. Marginal coverage (), prediction region volume, prediction error, and average conditional coverage make up the first four rows.

Diagnostics from the four conformal prediction regions are depicted in Table 2. The error tolerance for all prediction regions was set at . We see that all conformal prediction regions maintain their advertised finite sample marginal validity for the predictions of glycosylated hemoglobin. These prediction regions provide a balance between size, prediction error, and average conditional coverage (the average of the coverage probabilities taken over small subregions of the predictor space). Plots of three dimensional conformal prediction region are displayed in Figure 4. These plots are projections of each conformal prediction regions to the age (in years), weight (in pounds), and glycosylated hemoglobin percentage three dimensional space.

6 Discussion

The finite sample validity properties of conformal prediction regions have been verified in broader methodological contexts, including support vector machines, ridge regression, nearest neighbor regression, neural networks, and decision decision trees

(Vovk et al., 2005; Gammerman and Vovk, 2007; Papadopoulos et al., 2011; Vovk, 2012). While any conformity measure function will achieve finite sample validity, a careful choice may make the returned conformal prediction regions more useful in applications (Papadopoulos et al., 2011). The developments of conformal prediction in the machine learning literature show how empirically successful prediction methods in machine learning can be hedged to give valid predictions in finite samples (Gammerman and Vovk, 2007). The developments of conformal prediction in the statistics literature show that specification of the conformity measure to incorporate knowledge about the data generating process can lead to conformal prediction regions which are also asymptotically of minimal length (Lei and Wasserman, 2014; Dunn and Wasserman, 2018). Our parametric conformal prediction regions for GLM regression falls within this line of research. This line of research shows that when uncertainty about point predictions is considered, regression modeling provides smaller prediction regions when regions are required to give valid finite sample coverage. This is of course dependent on ability to specify a useful model which is not a trivial task in full generality.

Because GLMs are widely used by empiricists conducting regression analyses, parametric conformal prediction for GLM regression therefore may offer an appealing compromise for the applied researcher: when the GLM is correctly specified, conformal prediction regions are asymptotically minimal, finite sample local and marginal validity holds, and the rate of convergence is fast; when the GLM is incorrectly specified, asymptotic minimality is not guaranteed, but local finite sample validity still holds by construction. Researchers who currently use GLMs to compute prediction intervals for the mean regression function may be able to easily integrate conformal prediction into their data analysis workflow, since specification and fitting of the GLM is unchanged. A software package that accompanies this paper implements the parametric and nonparametric conformal prediction regions (Eck, 2018).

The robustness properties of conformal prediction come at a substantial computational cost, as noted in Vovk (2012). Two line searches are performed to determine the boundaries of the possibly disjoint conformal prediction region at every point at which a prediction region is desired. The conformity scores have to be recomputed with respect to augmented data at every iteration of these line searches. This involves refitting the GLM to augmented data at every iteration of the line search to construct the parametric conformal prediction region. Furthermore, when sample sizes are very large, conformal prediction may not offer much additional benefit beyond parametric conditional high density prediction regions for GLMs. However, when sample sizes are moderate conformal prediction may substantially outperform traditional methods in terms of finite sample marginal, local, and conditional coverage.

The asymptotic optimality properties of parametric conformal prediction regions follow from a novel concentration inequality for maximum likelihood estimation in exponential families (Theorem 2 in the Appendix). We expect that this result can be extended to include estimation in a broader class of models with convergence rates determined by the convergence rate of the estimator, provided that the score function corresponding to the data generating process obeys sub-exponential tail behavior. Dunn and Wasserman (2018)

show that finite sample validity holds in the presence of random effects. We expect that asymptotic optimality properties for parametric conformal prediction regions can be extended to their settings and to the class of generalized linear mixed models.

Supplementary Materials: Additional simulation results are available in the accompanying Supplementary Materials document. An accompanying R package is available at https://github.com/DEck13/conformal.glm (Eck, 2018). A technical report that includes the data and all of the code necessary to reproduce the findings, tables, and figures in this manuscript is available at https://github.com/DEck13/conformal.glm/tree/master/techreport.

Acknowledgments: We are grateful to Peter M. Aronow, Karl Oskar Ekvall, Jing Lei, Aaron J. Molstad, Molly Offer-Westort, Cyrus Samii, Fredrik Sävje, and Larry Wasserman for helpful comments. This work was supported by NIH grant NICHD 1DP2HD091799-01.

Appendix

The proof of Theorem 1 requires a new concentration inequality for MLEs of exponential family parameters. Our result is a generalization of Theorem 3.3 in Miao (2010) to sub-exponential random variables with powers of replacing , provided that the underlying density is an exponential family. Theorem 3.3 in Miao (2010) only holds for sub-Gaussian random variables. Our extension is possible because the score function is conveniently a mean zero sub-exponential random variable, the original random variable with its mean subtracted from it. This circumvents the necessity for the use of optimal transport theory (Bobkov and Götze, 1999; Ledoux, 1999, 2001; Djellout et al., 2004; Villani, 2008) to prove Theorem 3.3 in Miao (2010).

Concentration results for MLEs of exponential families

Let denote the norm of a vector or the absolute value of a scalar. The following definition and two lemmas are taken from Wainwright (2019, Chapter 2).

Definition 4.

A mean zero random variable is said to be sub-exponential with parameters if for all .

Lemma 4.

For a mean zero random variable , the following are equivalent: (a) is sub-exponential with parameters ; (b) There is a positive number such that for all ; (c)

Lemma 5.

Let be independent mean zero sub-exponential random variables with parameters . Then is sub-exponential with parameters where and

(13)

With these tools, we prove our concentration inequality for exponential families.

Theorem 2.

Let , , , , , , , be an independent and identically distributed sample of random variables such that has density (8) with parameter space (9). Let , and let be the MLE of . Then for any , there exists a numerical constant , such that

(14)
Proof.

The log likelihood of the conditional density (8) for our random sample is

Let . The gradient of is then

where

From the mean-value theorem, there exists some such that , , , , , which satisfies

Rearranging the above yields

where the inverse exists almost surely when . We see that

where is the induced -norm for a matrix with and

Choose some and let . Then for sufficiently large,

where the second inequality follows from the strong law of large numbers with respect to

. We can now choose some such that, for sufficiently large, we have

where the first inequality follows from the strong law of large numbers with respect to the term is the th component of and the second inequality follows from sub additivity of probability and the fact that a sum of elements is greater than or equal to a number if at least one term in the sum is greater than or equal to that number divided by the number of elements in the sum. From properties of exponential families and conditional expectation we have that