1 Introduction
Vovk et al. (2005)
introduced conformal prediction to construct finite sample valid prediction regions for predictions from nearestneighbor methods, supportvector machines, and sequential classification and both linear and ridge regression problems. The goal in conformal prediction is to construct a prediction region
from an iid random sample , ,, such that the probability of a future observation
belonging to the region exceeds a desired coverage level (Shafer and Vovk, 2008). That is, given , we seek a set such that for ,(1) 
Lei et al. (2013) extended the framework of Shafer and Vovk (2008) to provide a framework for which the prediction region is also asymptotically of minimal length. These prediction regions make use of a conformity measure which measures the agreement of a point with a probability measure . For example, when
, the probability density function
corresponding to is a conformity measure . Lei et al. (2013)proposed nonparametric kernel density estimation of
to construct conformal prediction regions that are asymptotically minimal under smoothness conditions on . Beyond realvalued outcomes, conformal methods have been proposed for functional data (Lei et al., 2015), nonparametric regression (Lei and Wasserman, 2014), time series data (Chernozhukov et al., 2018b), regression problems (Lei et al., 2018), random effects (Dunn and Wasserman, 2018), causal inference (Chernozhukov et al., 2018a), and machine learning
(Vovk et al., 2005; Gammerman and Vovk, 2007; Papadopoulos et al., 2011; Vovk, 2012; Burnaev and Vovk, 2014; Balasubramanian et al., 2014; Johansson et al., 2018; Wang et al., 2018). The finite sample validity property of conformal prediction regions has broad appeal in many scientific domains, including astrophysics (Ciollaro et al., 2018), medical applications (Lambrou et al., 2009; Devetyarov et al., 2012; Eklund et al., 2015; Bosc et al., 2019), genetics (Norinder et al., 2018b, a), and chemistry (CortésCiriano and Bender, 2018; Ji et al., 2018; Svensson et al., 2018a, b; Toccaceli et al., 2020).Generalized linear models (GLMs) are widely used regression models for outcomes that follow an exponential family distribution (McCullagh and Nelder, 1989)
. GLMs are popular in empirical research in the biomedical and social sciences; procedures for fitting GLMs is incorporated into every major statistical software package. Often predictive inference from a fitted GLM serves the primary scientific goals of the analysis. Point predictions from these models are usually combined with variance estimates from the bootstrap or delta method to construct prediction intervals. For example, interval predictions on the outcome scale may be constructed using the
predict.glm function in R (R Core Team, 2019) to construct Wald type intervals for either the mean response or its distribution.In this paper, we are interested in prediction intervals for the outcome random variable itself, not just the mean. Minimal length prediction regions can be constructed directly using the assumed GLM, but nominal coverage for these regions is not guaranteed in finite samples, and asymptotic claims require the GLM being to be correctly specified. In this article we show that conformal prediction regions achieve nominal coverage in finite samples for GLMs, even when the model is misspecified. Our parametric conformal prediction region estimated under the correct GLM with the link function specified is conditionally valid in local regions of the predictor space
(Lei and Wasserman, 2014, pg. 72), and is asymptotically conditionally valid for point predictions. Furthermore, we show that when the predictor space is bounded, this parametric conformal region is asymptotically of minimal length. The rate of convergence is when the dimension of predictor space is and is when where the predictor region of interest shrinks at a suitable rate. Similar results are suggested by Dunn and Wasserman (2018) in a general context involving random effects, but specific convergence rates are not provided. The rate is a consequence of how quickly the predictor region of interest shrinks. This convergence rate is faster than that presented by Lei and Wasserman (2014) which gave rates of for nonparametric regression predictions.In an extensive simulation study, we analyze marginal, local, and conditional prediction interval coverage, and largesample efficiency of four methods for constructing conformal prediction intervals from fitted GLMs: fully nonparametric kernelbased prediction (Lei and Wasserman, 2014), prediction for residuals and locally weighted residuals (Lei et al., 2018), and prediction using the parametric density as conformity measure. We find that when sample sizes are moderate – large enough to estimate regression coefficients reasonably precisely, but before asymptotic arguments guarantee good conditional coverage – conformal prediction methods outperform traditional methods in terms of finite sample marginal, local, and conditional coverage.Importantly, we find that parametric conformal prediction regions perform extremely well under mild model misspecification, they are calibrated to give valid finite sample coverage and they are not too large for meaningful inference in this setting. We demonstrate the utility of conformal prediction methods for GLMs in an application to diabetes diagnosis via interval prediction of glycosylated hemoglobin levels of subjects participating in a communitybased study (Willems et al., 1997).
2 Background
2.1 Conformal prediction for outcomes
The basic intuition underlying the conformal prediction method is that, given an independent sample from distribution defined on , the iid hypothesis that , , , using observation , , , is tested for each , and then is created by inverting this test (Lei et al., 2013; Vovk et al., 2005; Shafer and Vovk, 2008). More formally, suppose that , , , . Let be the corresponding empirical distribution and note that is symmetric in its arguments. Define
where , is a conformity measure which measures the “agreement” of a point with respect to a distribution (Lei et al., 2013). Informally, a conformity measure should be large when there is agreement between and . One important example of a conformity measure is the density function of (when it exists); this idea is used in Section 3.2. Exchangeability of follows from exchangeability of the random variables . Define as the random variable evaluated at . The conformal prediction region for is
(2) 
where Then by construction,
(3) 
which implies that Any conformity measure can be used to construct prediction regions with finite sample validity.
Lemma 1.
(Lei et al., 2013). Suppose , , , is an independent random sample from . Then , for all probability measures , and hence is valid.
Figure 1 shows schematically how this conformal prediction region is constructed.
2.2 Notions of finite sample validity in regression
Several notions of finite sample validity exist in the context of regression. Suppose that we have an iid sample , , , where the predictor is and is the outcome. We will suppose that of the ’s are main effects, where , and the other terms in are functions of the main effects. We will suppose that has support .
Definition 1 (marginal validity).
Let be a desired error tolerance. Let , , , be an iid sample from a continuous distribution . The prediction region has finite sample marginal validity if
where .
Finite sample marginal validity is guaranteed for conformal prediction methods by construction. Finite sample marginal validity alone may not be desirable when variability of the outcome is not constant across the support of the predictor. A second notion of finite sample validity arises by considering coverage of the prediction region conditional on a particular value of the predictor.
Definition 2 (conditional validity).
Let be a desired error tolerance. Let , , , be an iid sample from a continuous distribution . The prediction region has finite sample conditional validity at when
where .
However, conditional validity at every is unattainable if we insist that is asymptotically of minimal length. Let be the conditional density corresponding to regression of on . Define to be the value such that and define to be the optimal conditional prediction region (i.e., the minimal length prediction region). Let be Lebesgue measure and let denote the symmetric set difference operator. Then there does not exist a finite sample conditional valid prediction region such that holds, a result due to Lemma 1 of Lei and Wasserman (2014). A third notion of finite sample validity relaxes the stringent requirement of conditional validity at every possible .
Definition 3 (local validity).
Let be a desired error tolerance. Let , , , be an iid sample from a continuous distribution . Let be a partition of . We say that the prediction region has finite sample local validity when
for all where .
Local validity offers a bridge between marginal validity – which is inappropriate in the presence of heterogeneity – and conditional validity – which is unattainable when we require that the prediction interval is asymptotically minimal (Lei and Wasserman, 2014). The nonparametric conformal prediction region in Section 2.3.1 satisfies finite sample local validity and is asymptotically conditional valid (Lei and Wasserman, 2014). In Section 3.2 we show that the parametric conformal prediction region also satisfies finite sample local validity and is asymptotically conditional valid at rate faster than its nonparametric counterpart.
2.3 Existing conformal prediction techniques in regression
For convenience in what follows, suppose the support of the predictors is . Let and let , , , be an iid sample where is a continuous distribution, , with main effects. We will denote the true conditional density of the regression model as where is the regression coefficient vector and is a vector of nuisance parameters. For each , we define to be the point such that . The optimal or minimal length prediction region at is .
2.3.1 The nonparametric kernel estimator
Lei and Wasserman (2014) proposed a nonparametric conformal prediction region for continuous outcomes that satisfies finite sample local validity. Partitioning of the predictor space is performed so that the nonparametric conformal prediction region within each partition achieves finite sample marginal validity. Let
be a partition of the support of the predictor variable
. The number of elements in this partition increases at rate where is a rate that depends on the smoothness of the underlying density (Lei and Wasserman, 2014). A specific rate is discussed at the end of this Section.When we restrict the partition to be formed of equilateral cubes, we let the widths of these cubes decrease at rate . Let . Let be a nonnegative kernel function. The estimated local marginal density of is
where is the bandwidth. The density estimate given a new pair , is
The local conformity rank is then
and the conformal prediction band is
(4) 
for all . The intuition for constructing the conformal prediction region in this manner is that, asymptotically, the width shrinks so that the conditional distributions for all become very similar to that of for any particular . Lei and Wasserman (2014) provided that rate of convergence for which is asymptotically of minimal length. In the Supplementary Materials we show that when the underlying model is a GLM, is asymptotically of minimal length at rate .
2.3.2 Normalized residuals
Lei et al. (2018) proposed a prediction region obtained from conformal prediction for residuals, which is appropriate when errors are symmetric about the mean function . Lei et al. (2018, Section 5.2) also proposed an extension of their conformal prediction procedure that is appropriate when the errors about exhibit heterogeneity but remain symmetric. This extension involves a dispersion function that captures the changing variability across and is used to weight the residuals so that theses weighted residuals have the same magnitude, on average, across . The conformal procedure of Lei et al. (2018, Section 5.2), denoted as the “least squares locally weighted” method, proceeds as follows. When the mean regression estimator of is a symmetric function of the data points, augment the original data with a new point , and estimate the mean function and the dispersion function with respect to the augmented data. Define the normalized absolute residuals as
for . The normalized absolute residual is an example of an anticonformity measure, in which a low value of indicates agreement between and the estimated regression function. As in Lei et al. (2018), we specify the dispersion function as the conditional mean absolute deviation of as a function of . Let
be the proportion of points with normalized residuals smaller than the proposed normalized residual . Define the least squares locally weighted (LSLW) conformal prediction regions as
The least squares (LS) conformal prediction region is constructed as in with no weighting with respect to residuals, . These conformal prediction regions achieve finite sample marginal validity as a consequence of exchangeability of the data and symmetry of in its arguments (Lei et al., 2018).
3 Conformal prediction for generalized linear regression
We now introduce parametric conformal prediction regions for exponential family distributions and a class of GLM with continuous outcomes. We show that the parametric conformal prediction region possesses the same finite sample local validity properties as its nonparametric counterpart, but its convergence to the true minimum length prediction region is faster. In the Supplementary Materials, we verify that the nonparametric conformal region of Lei and Wasserman (2014) is appropriate for the class of GLMs that we consider.
3.1 Exponential family distributions and generalized linear models
Let be a positive Borel measure on a finitedimensional vector space that is absolutely continuous with respect to Lebesgue measure . The log Laplace transform of is the function defined by
(5) 
where is the dual space of , is the inner product which places and in duality, and is the extended real number system, which adds the values and to the real numbers (Rockafellar and Wets, 1998, Section 1.E). We will let . The effective domain of on is For every , the function defined by
(6) 
is a probability density with respect to . We refer to and as the canonical statistic and canonical parameter respectively. The set where is any nonempty subset of , is called an exponential family of densities with respect to . This family is full if . We refer to as the generating measure of the exponential family. When is taken to be the generating measure, the density (6) becomes,
(7) 
where is a parameterfree continuous function of the canonical statistic. We will only consider scalar outcome variables. However
can occur in this case, the normal distribution being an example with canonical statistic
. Thus we will assume that the canonical statistic vector lies on a onedimensional manifold of .When the density (6
) corresponds to a generalized linear regression model we will reparameterize
, where is continuous in both arguments, the vectors and are a predictors and regression coefficients respectively, and is a vector of nuisance parameters. We specify that the base set (main effects) of predictors have support , . Here we assume, without loss of generality, that where is a link function and is the first component of the canonical statistic vector. As an example of this reparameterization, consider the simple linear regression model with homoskedastic normal errors with variance given by . In this example , , and . The link function is taken to be the identity function and .The reparameterized density corresponding to the generalized linear regression model is then
(8) 
with respect to generating measure . We will further assume that is bounded and that the exponential family is full with parameter space given by,
(9) 
All of the continuous exponential families implemented in the glm function in R have densities that can be parameterized as (8). Again consider the multiple linear regression model with homoskedastic normal errors with variance given by . Here we have and . When is taken to be the generating measure, the density (8) becomes
(10) 
3.2 Parametric conformal prediction regions
We now assume that the link function of the generalized linear regression model is known, the model is correctly specified, and that is a consistent estimator of the regression coefficients . Let be a partition of the support of the predictor variable in the form of equilateral cubes with sides of length when and when . Let be the true conditional density function and let be the estimated density fit using the original data and define as the estimated density fit to the augmented data . We will use this estimated density as our conformity measure.
The local conformal prediction region is then
(11) 
where for . We show that the region in (11) has finite sample local validity and is asymptotically minimal for all generalized linear models that can be parameterized as (8). A schematic of how this conformal prediction region is constructed is displayed in the righthand side of Figure 1.
Lemma 2.
Let be a partition of and let be as defined in (11). Then is finite sample locally valid for .
Proof.
The proof follows from the proof of (Lei and Wasserman, 2014, Proposition 2). Fix and let , , , . Let , be another independent sample. Let and for all . Then conditioning on and , , , the sequence , , , is exchangeable. ∎
We now establish that the parametric conformal prediction region is asymptotically of minimum length over .
Assumption 1 (Assumption 1 (a) of Lei and Wasserman (2014)). Let be the support of and assume that . The marginal density of satisfies for all .
Theorem 1.
Let , , , , , , , be an independent and identically distributed sample of random variables with conditional density (10) and parameter space (9). Assume that and that the canonical statistic vector is a one dimensional manifold. Suppose that Assumption 1 holds. Let and be the prediction band given by (11). Then, for a given , there exists a numerical constant such that
(12) 
where .
The proof of Theorem 1 is given in the Appendix. The rates and are appreciably faster than that of the nonparametric conformal prediction band which has a convergence rate of when the underlying is an exponential family. The difference between the convergence speed of the parametric and nonparametric conformal prediction regions originates from the differences in the rates of MLEs and nonparametric techniques and the speed at which the bin widths shrink. The following Lemma governs how fast the bins can shrink while still maintaining that .
Lemma 3 (Lemma 9 in Lei and Wasserman (2014)).
Under Assumption 1, there exists constants and such that
with and defined in Assumption 1.
The key term in Lemma 3 is the term appearing in the exponent. The proof of Theorem 1 reveals that the rate of convergence of the parametric conformal prediction region is limited by Lemma 3. This is not the case for the nonparametric conformal prediction region where the limiting factor is the rate of convergence of the kernel density estimator.
4 Simulation study
We consider three simulation settings to evaluate the performance of conformal prediction regions under correct model specification and model misspecification. These simulation settings are:

Gamma regression with and . Data are generated from a Gamma regression model, and the parametric conformal and highest density prediction regions are correctly specified. A cubic regression model is assumed for the least squares and least squares locally weighted conformal prediction regions.

Gamma regression with and . Data are generated from a Gamma regression model, and a cubic regression model with homoskedastic normal errors is assumed for the highest density prediction region and the misspecified parametric, least squares, and least squares locally weighted conformal prediction regions.

Simple linear regression with , and normal errors with constant variance . Results are considered for sample sizes . In this setting the regression model is correctly specified for the highest density prediction region and the parametric, least squares, and least squares locally weighted conformal prediction regions.
Following the bin width asymptotics of Lei and Wasserman (2014), the number of bins used to form the parametric and nonparametric conformal prediction regions is 2 when and 3 when . All simulations correspond to univariate regressions and the predictor variables were generated as . Figure 2 shows the four example conformal regions (LS, LSLW, nonparametric, and GLM parametric) for simulation setting A, B, and C.
Several diagnostic measures are used to compare conformal prediction regions. These diagnostic measures compare prediction regions by their prediction error, volume, and coverage properties. Our prediction error diagnostic metric will be an average of the squared distances of observations outside of the prediction region to the closest boundary of the prediction region, averaged over all observations. An observation that falls within the prediction region has an error of 0. More formally this prediction error metric is
where and are, respectively, the lower and upper boundaries of possible disjoint intervals forming the prediction region.
The volume of each prediction region will be estimated by the average of the upper boundary minus the lower boundary across observed , written as
To assess finite sample marginal validity we calculate the proportion of responses that fall within the prediction region. To assess finite sample local validity with respect to binning we first bin the predictor data and then, for each bin, we calculate the proportion of responses that fall within the prediction region. The same procedure is used to assess finite sample conditional validity, but we use a much finer binning regime than what was used to assess finite sample local validity.
In our simulations, we find that the parametric conformal prediction region performs well even when the model is moderately misspecified. By construction, this region, along with the nonparametric conformal prediction region, maintains finite sample local validity with respect to binning as seen in the bottom row of both plots in Figure 3 and Table 1. However, the parametric and nonparametric conformal prediction regions are have different shape conditional on , as seen in Figure 2, and give different prediction errors as seen in the top row of both plots in Figure 3 and Table 1. The parametric conformal prediction region adapts naturally to the data when the model is correctly specified or when modest deviations from the specified model are present. Large deviations from model misspecification are not handled well as seen in the top row of the right hand side of Figure 3, although the parametric conformal prediction region does give nominal marginal and local coverage with respect to binning in finite samples as seen in the bottom row of the right hand side of Figure 3. On the other hand, the nonparametric conformal prediction region does not adapt well to data obtained from a Gamma regression model or data obtained from a linear regression model with a steep mean function.
Parametric  Nonparametric  LS  LSLW  HD  

Diagnostic  conformal  conformal  conformal  conformal  region  
marginal coverage  
local coverage  
local coverage  
area  
prediction error  
marginal coverage  
local coverage  
local coverage  
local coverage  
area  
prediction error  
marginal coverage  
local coverage  
local coverage  
local coverage  
area  
prediction error 
The least squares conformal prediction region obtains marginal validity (Lei et al., 2018) but performs poorly when deviations about the estimated mean function are either not symmetric, not constant, or both. When heterogeneity is present, the least squares conformal prediction region exhibits undercoverage in regions where variability about the mean function is large and overcoverage in regions where variability about the mean function is small. Clear evidence of these features are seen in Figure 3 and 2. This conformal prediction region is very sensitive to model misspecification. The least squares locally weighted conformal prediction region also obtains marginal validity (Lei et al., 2018, Section 5.2) and it is far less sensitive to model misspecification than the least squares conformal prediction region and it performs well under modest model misspecification. However, the least squares locally weighted conformal prediction region is not appropriate when deviations about an estimated mean function are obviously not symmetric, as evidenced in the top row of Figure 2. Results from additional simulations corresponding to settings A and B are provided in the Supplementary Materials. The findings from these additional simulations are consistent with the conclusions of the simulations presented in this Section.
5 Predicting the risk of diabetes
Diabetes is a group of metabolic diseases associated with longterm damage, dysfunction, and failure of different organs, especially the eyes, kidneys, nerves, heart, and blood vessels (American Diabetes Association, 2010). In 2017 approximately 5 million adult deaths worldwide were attributable to diabetes; global healthcare expenditures on people with diabetes are estimated USD 850 billion (Cho et al., 2018). Diabetes remains undiagnosed for an estimated 30% of the people who have the disease (Heikes et al., 2008). One way to address the problem of undiagnosed diabetes is to develop simple, inexpensive diagnostic tools that can identify people who are at high risk of prediabetes or diabetes using only readilyavailable clinical or demographic information (Heikes et al., 2008).
We examine the influence of several variables on blood sugar, or glycosylated hemoglobin percentage (also known as HbA1c), an important risk factor for diabetes. A glycosylated hemoglobin value of 6.5% can be used as a cutoff for positive diagnosis of diabetes (World Health Organization, 2011). We predict an individual’s glycosylated hemoglobin from their height, weight, age, and gender, all of which are easy to measure, inexpensive, and do not require any laboratory testing. The data in this analysis come from a populationbased sample of 403 rural AfricanAmericans in Virginia (Willems et al., 1997), taken from the faraway R package (Faraway, 2016). We considered a gamma regression model that only includes linear terms for each covariate, a linear regression model with homoskedastic normal errors and the same linear terms for each covariate, and a linear regression model with homoskedastic normal errors that also included quadratic terms for each covariate. Of these considered the models, the gamma regression model fit the data best on the basis that it had the lowest AIC value and it gave the best predictive predictive performance on the basis that it had the lowest sum of squares prediction error. That being said, we do not know the data generating process.
Based on these covariates, conformal prediction regions provide finite sample valid prediction regions for glycosylated hemoglobin that may be useful for diagnosing diabetes in this study population. Four conformal prediction regions are considered for predicting glycosylated hemoglobin percentage. These conformal prediction regions are the parametric conformal prediction region with a Gamma model fit, the parametric conformal prediction region with a Gaussian model fit, the least squares conformal prediction region, and the least square conformal prediction region with local weighting. All conformal prediction regions correspond to models that only include linear terms for each of the covariates. The Gamma and Gaussian parametric conformal prediction regions were computed with binning across the binary gender factor variable, the predictor space is partitioned across genders. However, no additional binning structure within the levels of gender was employed.
Gamma  Gaussian  LSLW  LS  

conformal  conformal  conformal  conformal  
marginal coverage  0.909  0.906  0.906  0.909 
volume  7.349  7.730  8.983  7.601 
pred error  0.931  0.849  0.744  0.883 
avg. cond. coverage  0.874  0.849  0.889  0.860 
Diagnostics from the four conformal prediction regions are depicted in Table 2. The error tolerance for all prediction regions was set at . We see that all conformal prediction regions maintain their advertised finite sample marginal validity for the predictions of glycosylated hemoglobin. These prediction regions provide a balance between size, prediction error, and average conditional coverage (the average of the coverage probabilities taken over small subregions of the predictor space). Plots of three dimensional conformal prediction region are displayed in Figure 4. These plots are projections of each conformal prediction regions to the age (in years), weight (in pounds), and glycosylated hemoglobin percentage three dimensional space.
6 Discussion
The finite sample validity properties of conformal prediction regions have been verified in broader methodological contexts, including support vector machines, ridge regression, nearest neighbor regression, neural networks, and decision decision trees
(Vovk et al., 2005; Gammerman and Vovk, 2007; Papadopoulos et al., 2011; Vovk, 2012). While any conformity measure function will achieve finite sample validity, a careful choice may make the returned conformal prediction regions more useful in applications (Papadopoulos et al., 2011). The developments of conformal prediction in the machine learning literature show how empirically successful prediction methods in machine learning can be hedged to give valid predictions in finite samples (Gammerman and Vovk, 2007). The developments of conformal prediction in the statistics literature show that specification of the conformity measure to incorporate knowledge about the data generating process can lead to conformal prediction regions which are also asymptotically of minimal length (Lei and Wasserman, 2014; Dunn and Wasserman, 2018). Our parametric conformal prediction regions for GLM regression falls within this line of research. This line of research shows that when uncertainty about point predictions is considered, regression modeling provides smaller prediction regions when regions are required to give valid finite sample coverage. This is of course dependent on ability to specify a useful model which is not a trivial task in full generality.Because GLMs are widely used by empiricists conducting regression analyses, parametric conformal prediction for GLM regression therefore may offer an appealing compromise for the applied researcher: when the GLM is correctly specified, conformal prediction regions are asymptotically minimal, finite sample local and marginal validity holds, and the rate of convergence is fast; when the GLM is incorrectly specified, asymptotic minimality is not guaranteed, but local finite sample validity still holds by construction. Researchers who currently use GLMs to compute prediction intervals for the mean regression function may be able to easily integrate conformal prediction into their data analysis workflow, since specification and fitting of the GLM is unchanged. A software package that accompanies this paper implements the parametric and nonparametric conformal prediction regions (Eck, 2018).
The robustness properties of conformal prediction come at a substantial computational cost, as noted in Vovk (2012). Two line searches are performed to determine the boundaries of the possibly disjoint conformal prediction region at every point at which a prediction region is desired. The conformity scores have to be recomputed with respect to augmented data at every iteration of these line searches. This involves refitting the GLM to augmented data at every iteration of the line search to construct the parametric conformal prediction region. Furthermore, when sample sizes are very large, conformal prediction may not offer much additional benefit beyond parametric conditional high density prediction regions for GLMs. However, when sample sizes are moderate conformal prediction may substantially outperform traditional methods in terms of finite sample marginal, local, and conditional coverage.
The asymptotic optimality properties of parametric conformal prediction regions follow from a novel concentration inequality for maximum likelihood estimation in exponential families (Theorem 2 in the Appendix). We expect that this result can be extended to include estimation in a broader class of models with convergence rates determined by the convergence rate of the estimator, provided that the score function corresponding to the data generating process obeys subexponential tail behavior. Dunn and Wasserman (2018)
show that finite sample validity holds in the presence of random effects. We expect that asymptotic optimality properties for parametric conformal prediction regions can be extended to their settings and to the class of generalized linear mixed models.
Supplementary Materials: Additional simulation results are available in the accompanying Supplementary Materials document. An accompanying R package is available at https://github.com/DEck13/conformal.glm (Eck, 2018). A technical report that includes the data and all of the code necessary to reproduce the findings, tables, and figures in this manuscript is available at https://github.com/DEck13/conformal.glm/tree/master/techreport.
Acknowledgments: We are grateful to Peter M. Aronow, Karl Oskar Ekvall, Jing Lei, Aaron J. Molstad, Molly OfferWestort, Cyrus Samii, Fredrik Sävje, and Larry Wasserman for helpful comments. This work was supported by NIH grant NICHD 1DP2HD09179901.
Appendix
The proof of Theorem 1 requires a new concentration inequality for MLEs of exponential family parameters. Our result is a generalization of Theorem 3.3 in Miao (2010) to subexponential random variables with powers of replacing , provided that the underlying density is an exponential family. Theorem 3.3 in Miao (2010) only holds for subGaussian random variables. Our extension is possible because the score function is conveniently a mean zero subexponential random variable, the original random variable with its mean subtracted from it. This circumvents the necessity for the use of optimal transport theory (Bobkov and Götze, 1999; Ledoux, 1999, 2001; Djellout et al., 2004; Villani, 2008) to prove Theorem 3.3 in Miao (2010).
Concentration results for MLEs of exponential families
Let denote the norm of a vector or the absolute value of a scalar. The following definition and two lemmas are taken from Wainwright (2019, Chapter 2).
Definition 4.
A mean zero random variable is said to be subexponential with parameters if for all .
Lemma 4.
For a mean zero random variable , the following are equivalent: (a) is subexponential with parameters ; (b) There is a positive number such that for all ; (c)
Lemma 5.
Let be independent mean zero subexponential random variables with parameters . Then is subexponential with parameters where and
(13) 
With these tools, we prove our concentration inequality for exponential families.
Theorem 2.
Proof.
The log likelihood of the conditional density (8) for our random sample is
Let . The gradient of is then
where
From the meanvalue theorem, there exists some such that , , , , , which satisfies
Rearranging the above yields
where the inverse exists almost surely when . We see that
where is the induced norm for a matrix with and
Choose some and let . Then for sufficiently large,
where the second inequality follows from the strong law of large numbers with respect to
. We can now choose some such that, for sufficiently large, we havewhere the first inequality follows from the strong law of large numbers with respect to the term is the th component of and the second inequality follows from sub additivity of probability and the fact that a sum of elements is greater than or equal to a number if at least one term in the sum is greater than or equal to that number divided by the number of elements in the sum. From properties of exponential families and conditional expectation we have that
Comments
There are no comments yet.