The Importance of Discussing Assumptions when Teaching Bootstrapping

Bootstrapping and other resampling methods are progressively appearing in the textbooks and curricula of courses that introduce undergraduate students to statistical methods. Though simple bootstrap-based inferential methods may have more relaxed assumptions than their traditional counterparts, they are not quite assumption-free. Students and instructors of these courses need to be aware of differences in the performance of these methods when their assumptions are or are not met. This article details some of the assumptions that the simple bootstrap relies on when used for uncertainty quantification and hypothesis testing. We emphasize the importance of these assumptions by using simulations to investigate the performance of these methods when they are or are not met. We also discuss software options for introducing undergraduate students to these bootstrap methods, including a newly developed package.

Authors

• 1 publication
• 3 publications
• 4 publications
10/17/2021

Centroid Approximation for Bootstrap

Bootstrap is a principled and powerful frequentist statistical tool for ...
03/26/2019

Deterministic bootstrapping for a class of bootstrap methods

An algorithm is described that enables efficient deterministic approxima...
05/16/2022

Social Aspects of Software Testing: Comparative Studies in Asia

This study attempts to understand motivators and de-motivators that infl...
11/24/2021

Multiplier bootstrap for Bures-Wasserstein barycenters

Bures-Wasserstein barycenter is a popular and promising tool in analysis...
06/21/2021

maars: Tidy Inference under the 'Models as Approximations' Framework in R

Linear regression using ordinary least squares (OLS) is a critical part ...
02/23/2020

Hypothesis testing for eigenspaces of covariance matrix

Eigenspaces of covariance matrices play an important role in statistical...
03/12/2020

Comments on Design and Implementation of Model-Predictive Control With Friction Compensation on an Omnidirectional Mobile Robot'

There are errors in the dynamics model in <cit.>. In addition, some deta...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bootstrapping is a computer-based method introduced by bradleybootstrap as a technique for estimating the standard deviation of a sample statistic. In general, the term

bootstrap sampling

refers to the process of randomly sampling with replacement from the original sample. This process is taken to be analogous to sampling from the entire population and, as noted by tibshirani1993introduction, the bootstrap estimate of standard error is always available, regardless of the complexity of the original estimator.

Since its introduction, bootstrap methods have gained popularity [see][]horowitz2019bootstrap, utzet2021some and found use in a variety of diverse applications such as linear regression [see][]eck2018bootstrapping, pelawa2021bootstrapping and bootstrap aggregated neural networks [see][]khaouane2017modeling, osuolale2018exergetic. The growth of statistical computing has also led to the bootstrap appearing more regularly in courses which introduce undergraduate students to statistical methods with examples including courses taught at Stanford University

111STAT 191 - Introduction to Applied Statistics at Stanford University (https://explorecourses.stanford.edu), The Pennsylvania State University222STAT 200 - Elementary Statistics at The Pennsylvania State University (https://online.stat.psu.edu/stat200/), Oregon State University333STAT 351/352 - Introduction to Statistical Methods I & II (https://stat.oregonstate.edu/content/yearly-courses), and Montana State University444STAT 216 - Introduction to Statistics at Montana State University (https://math.montana.edu/courses/s216/). Textbooks about, or which feature, the bootstrapping method thus range from the seminal graduate-level text by tibshirani1993introduction to intro-level texts [e.g.][]field2012discovering, moderndive, lock2020statistics.

When deciding what specifics to teach about the bootstrap, instructors can either equip their students with a useful tool or unknowingly create absolved students. Unfortunately, the latter occurs far too frequently, producing debates on whether the bootstrap should be taught, how the bootstrap is taught, what aspects to teach, and more. Pedagogical discussions on how to teach the bootstrap include those given by hesterberg2015teachers and hayden2019questionable.

In this article, we highlight the importance of discussing the assumptions behind simple bootstrap hypothesis tests and confidence intervals. By assumptions, we mean the suppositions under which the theoretical details of these intervals are derived. Namely, those having to do with pivotal quantities. Our focus is on the studentized, basic, and percentile bootstrap intervals and their corresponding hypothesis tests. We choose this focus because these methods, or methods related to them, are often taught in undergraduate introductory statistics courses, such as those previously noted.

In Section 2, we discuss the benefits of and current issues pertaining to teaching statistical computing and the bootstrap, as found in the literature on statistics education. In Section 3, we discuss the theoretical details of the bootstrap in order to clearly point out the assumptions behind these methods. In Section 4, we use simulation to evaluate the performance of these methods when their assumptions are or are not met. In Section 5, we discuss tools that can be used when teaching the bootstrap, including a new R package that emphasizes communicating assumptions. We finish in Section 6 with some concluding remarks.

2 Benefits of Teaching Statistical Computing and Bootstrapping

According to the Guidelines for Assessment in Statistics Education (GAISE), students in introductory statistics courses should, “Demonstrate an understanding of, and ability to use, basic ideas of statistical inference, both hypothesis tests and interval estimation, in a variety of settings.” [p. 8]gaise This implies that students should be able to recognize when a particular statistical method detracts from the quality of their analysis. Furthermore, upon realizing this, they should be able to pull an alternative method from their knowledge-base and apply it appropriately.

Depending on learning goals and student backgrounds, including topics that incorporate statistical computing in a course, such as resampling, randomization, or simulation, can help students achieve these objectives. For example, wood2005role notes that through general simulation methods students are able to “actively and intelligently” apply the methods they are taught to solve problems of current concern. Also, tintle2012retention found that the use of a randomization-based curriculum led to a higher retention of concepts, after four months, than using the consensus curriculum based on agresti2018statistics. Simulation methods are also incorporated by learnsci in their discussed “practicing connections” approach to building an introductory statistics course. Their approach was found to make students capable of applying previously learned material in new and more sophisticated contexts.

The idea that statistical computing should be taught more, in order to better equip students for present-day workforce expectations, undercurrents much of the literature on statistics in the undergraduate curriculum and, in general, statistics education. Besides the aforementioned literature, various articles in the collection complied by horton2015teaching express this idea. Technically, the more statistical methods a student is introduced to, the better equipped they should be to meet the GAISE guideline discussed earlier and to tackle real-world data challenges. In reality though, as students learn more statistical methods, discerning which one is appropriate to use becomes harder. Especially if students are not clearly taught how to check whether a method is appropriate for their data.

For example, many incorrect or unfounded claims have been made about simple bootstrap intervals, making it hard to know when their use is appropriate. These claims were investigated in greater detail by hayden2019questionable for the percentile bootstrap interval, which we define in Section 3, and the bootstrap interval that uses twice the standard deviation of the bootstrap distribution as the margin of error. In their article, the claim that these bootstrap intervals have fewer or no underlying assumptions than their traditional counterparts was debunked and shown to clearly be false.

It was also found that these bootstrap intervals do not actually perform better when normality and large sample size conditions are not met. Their supposed simplicity was said to be the result of a failure to communicate their assumptions as clearly as those of the traditional methods. Introducing these intervals to students, before appropriate scenarios for their use are better established, was discouraged. In hesterberg2015teachers, issues with the percentile and basic bootstrap intervals were also discussed and use of the studentized bootstrap interval (called the bootstrap interval there) was said to be preferable.

Despite these issues, it was noted by lock2008introducing that students’ understanding of confidence intervals and statistical inference rests greatly on their understanding of sampling distributions. They suggested that teaching bootstrapping allows students to make inferences on non-conventional parameters and that their understanding of concepts like sampling distributions can be fortified through the use of simulations and bootstrapping. Indeed, when the form of the standard error of an estimator cannot be derived using statistical theory or it depends on unknown parameters and/or the true distribution of the estimator is completely unknown, the bootstrap can be useful, provided that its own assumptions are met.

howington2017teach notes that that, though its desirability as a measure of center when the data are skewed is often mentioned, corresponding inferential methods for the median are rarely taught in introductory statistics courses. Suggested methods for teaching confidence intervals on the median included use of the bootstrap. The use of simulation-based inferential methods was also discussed by gehrke2021statistics, where the incorporation of methods, such as the bootstrap, in their improved curriculum, helped students to more clearly explain

-values and confidence intervals and to understand the limitations of statistics as it pertains to describing the real world.

Given its pedagogical and methodological benefits, eliminating the bootstrap from the undergraduate statistics curriculum may not be the best solution to the issues surrounding it. An alternative would be to better understand the assumptions behind these methods and how they perform when these are not met, so that this can be communicated to students. This can lead to students learning and applying them more carefully.

In the next section, we discuss some of the theoretical underpinnings of the basic, studentized, and percentile bootstrap intervals, as well as their corresponding bootstrap hypothesis tests. Specifically, our intention is to show that these methods rely on assumptions concerning pivotal quantities. For a more rigorous and expansive discussion on the theory behind the bootstrap, we refer readers to athreya2006measure.

3 General Assumptions for Simple Applications of Bootstrapping

In order to make an inference on a population parameter, , we begin by taking an independent and identically distributed sample of size , , from the population of interest. This sample should be taken in such a way that it captures all of the information in the population about . We denote an estimate for based on the original observed data as . If this is calculated with a bootstrap sample we use . If it is based on the not yet observed data we use , where X

denotes the unobserved data vector. The estimate,

, should summarize the information about that is contained in the observed data. For example, if is the population mean, then may be the observed sample mean, .

However, we often desire to gather more information than that contained in alone. Options for achieving this include confidence intervals and hypothesis testing. Many methods exist for constructing confidence intervals and hypothesis testing, such as - and -methods for the mean and jackknife or permutation approaches. When the parameter of interest is one which does not have an established method or the data do not meet the conditions for using traditional methods, alternative methods can be used. These alternative methods will likely have their own assumptions and these should also be checked. If they are reasonable, then the alternative method can be used. One such alternative is the bootstrap, whose details and assumptions we discuss in this section. Specifically, we highlight the dependence of these intervals on pivotal quantities.

The concept of bootstrapping through simple random resampling is as follows: Obtain samples, , each of size , by resampling from the original sample, , with replacement and calculate their corresponding statistics, . These bootstrap statistics make up the bootstrap distribution. Though the underlying concepts of this bootstrapping method may seem straightforward, there are many details that users should be aware of when applying it for interval estimation and hypothesis testing.

3.1 Interval estimation

The bootstrap distribution can be used as an estimate of the sampling distribution, which provides a means for quantifying the uncertainty in an estimate. The basic, percentile, and studentized bootstrap intervals each use the bootstrap distribution in this manner but have different underlying assumptions, most of which pertain to the shifted or studentized sampling distribution. The details we discuss next will be helpful for readers who desire to become familiar with these bootstrap intervals as they are presented by davison1997bootstrap and tibshirani1993introduction. Our explanation is not exhaustive, however, so readers who desire a more in-depth understanding of these methods and their assumptions should consult those texts directly. Those who are already familiar with these methods, can skip to the summary of their form and assumptions given in Table 1.

Let

denote the significance level or desired Type I error rate. In order to construct a

confidence interval for , we may employ a pivotal quantity

- a quantity whose distribution does not depend on any unknown parameters. When this quantity is a function of the parameter and estimate, the quantiles of its distribution can be used to construct confidence intervals for the parameter.

Denote the and quantiles of the distribution of as and , and suppose that this quantity is pivotal. In general, when we refer to the quantile of the distribution of (or its shifted or scaled versions), we are referring to the value, , for which . If and are known, then

 1−α=P(aα/2≤^θ(X)−θ≤a1−α/2)=P(^θ(X)−a1−α/2≤θ≤^θ(X% )−aα/2)

and a equi-tailed interval for , provided the expression exists, is

 (^θ(x)−a1−α/2,^θ(x)−aα/2). (1)

If the distribution of is unknown, the problem becomes one of estimating and . Using the bootstrap distribution, one may estimate these quantiles in a variety of ways. For convenience, we discuss estimation in terms of the quantile, .

Basic bootstrap interval (the base case): This interval is obtained by estimating with the -th smallest value of the distribution of . For example, if and then

 (B+1)(α/2)=(999+1)0.025=1000∗0.025=25

and similarly, . Thus, the 25th smallest and 975th smallest values of the distribution of , denoted as and , would be used to estimate and , respectively. Note that the 25th smallest value is less than the 975th smallest value so the upper bound will be greater than the lower bound since we subtract off a smaller number55footnotemark: 5.

11footnotetext: In this article, we assume that and are integers. If they are not, then the procedure outlined by tibshirani1993introduction can be used, assuming . Define as the largest integer that is . Then the and quantiles are defined as the -th largest and -th largest values of the distribution of interest, respectively.

Using this estimate, the expression in (1) becomes

 (^θ(x)−a∗((B+1)(1−α/2)),^θ(x)−a∗((B+1)(α/2))).

Let and be the quantiles of the distributions of and , respectively, then note that

 a((B+1)p)=r((B+1)p)−θanda∗((B+1)p)=r∗((B+1)p)−^θ(x).

Therefore, the bounds of the interval can be further simplified to

 ^θ(x)−a∗((B+1)(1−α/2))=2^θ(x)−r∗((B+1)(1−α/2))and^θ(x)−a∗((B+1)(α/2))=2^θ(x)−r∗((B+1)(α/2)).

The final form of the basic bootstrap interval is then

 (2^θ(x)−r∗((B+1)(1−α/2)),2^θ(x)−r∗((B+1)(α/2))).

If there are any constraints on the value of , the bounds of this interval may not meet these constraints. That is, this interval can contain values that are not plausible for the population parameter, such as values below 0 or above 1 in an interval for the population proportion (see Section 4). The accuracy of this interval depends on how well the distribution of conforms to that of . If the latter does not depend on any unknown parameters, then is actually a pivotal quantity and conformity can be expected.

Percentile bootstrap interval (the symmetric case): If the distribution of is symmetric about zero, then and . Therefore, we can rewrite (1) as

 (^θ(x)+aα/2,^θ(x)+a1−α/2).

Upon estimating these quantiles with the appropriate order statistics from the bootstrap distribution we obtain

 (^θ(x)+a∗((B+1)(α/2)),^θ(% x)+a∗((B+1)(1−α/2))).

Observe that , so instead we can write

 ^θ(x)+a∗((B+1)(α/2))=r∗((B+1)(α/2))and^θ(x)+a∗((B+1)(1−α/2))=r∗((B+1)(1−α/2)).

The final form of the percentile bootstrap interval is then

 (r∗((B+1)(α/2)),r∗((B+1)(1−α/2))).

Some positive aspects about the percentile interval, which are noted by carpenter2000bootstrap, are that it is simple and transformation respecting. Its accuracy also depends on how well the distribution of agrees with that of , both of which should be symmetric by the stated assumption.

It is noted by tibshirani1993introduction that neither the percentile nor basic bootstrap intervals, “work well in general”. Specifically, when the quantity is not pivotal, which is commonly the case, they give low accuracy. For the percentile interval, a suggested improvement is the bias-corrected and adjusted bootstrap interval (), which rectifies the bias in . Its details are discussed in Chapter 14 of tibshirani1993introduction. These details are more intricate and complex than those of the percentile and basic interval and, depending on the students’ mathematical backgrounds, they may be outside of the scope of an undergraduate introductory statistical methods course.

Studentized bootstrap interval (the studentized case): Under some circumstances, such as when , the distribution of

is asymptotically Normal with mean 0 and variance

, where denotes an estimate for the standard error of . This provides another option for estimating and . Namely, with and , respectively.

For finite samples, however, this is only an approximation. In the case of the sample mean, a better approximation may be obtained by using the quantiles of a distribution, which accounts for estimating the standard error. In this case, and are estimated with and , respectively.

The studentized bootstrap interval, also known as the bootstrap -interval, further replaces these -quantiles with a bootstrap approximation. Rather than using a - or -table, the studentized bootstrap interval uses “bootstrap tables” which are fit for the specific data set observed. This adjusts for skewness in the underlying population and other errors that can arise when is not the sample mean.

The values and are estimated with the -th and -th smallest values of the distribution of respectively, where is an observed estimate of the standard error of . Substituting these bootstrap estimates leads to an interval whose final form is

 (^θ(x)−^SE(^θ(x))∗z∗((B+1)(1−α/2)),  ^θ(x)−^SE(^θ(x))∗z∗((B+1)(α/2))).

Though the Central Limit Theorem (CLT) gives a formula for the standard error of the mean, there are many statistics which do not have such a formula. The bootstrap may be used to obtain estimates for the standard errors of

and . The plug-in principle discussed by tibshirani1993introduction can be used to estimate the standard error of with the square root of

 ^σ2=1B−1B∑i=1(^θ(x∗i)−¯^θ(x∗(⋅)))2,

where denotes the mean of the bootstrap sample statistics.

In order to estimate the standard error of an iterative bootstrap method can be used. In this method one obtains second-level bootstrap samples from each of the original bootstrap samples. For each of these second-level bootstrap samples, statistics are then calculated and denoted as for and . From these we calculate the bootstrap estimate of standard error for the bootstrap sample as the square root of

 ^σ2∗i=1M−1M∑j=1(^θ(x∗i,j)−¯^θ(x∗i(⋅)))2,

where now represents the mean of the second-level bootstrap sample statistics.

While tibshirani1993introduction suggests that is sufficient for estimating the standard error of a bootstrap estimate, is needed for estimating any desired quantiles. A few suggestions for , ranging from to , are also given by davison1997bootstrap under different scenarios. Depending on computational resources, these bootstrap methods, especially the studentized interval, may be considered computationally expensive. If and , then over twenty-four thousand resamples must be performed in total.

As with the basic and percentile bootstrap intervals, the accuracy of this interval depends on whether the distribution of

is indeed pivotal. It is noted by tibshirani1993introduction that the results of the studentized bootstrap interval can be largely influenced by outliers in the data. They also warn that the studentized bootstrap interval works best for variance-stabilized parameters and that it is especially applicable to location statistics.

Table 1 summarizes the three different bootstrap-based interval estimation methods discussed in this section along with their accompanying assumptions.

3.2 Hypothesis testing

The goal of hypothesis testing is to make an inference about some population parameter of interest, , specifically in regards to whether or not there is sufficient evidence to indicate that the parameter is a value other than one which we hypothesize to be true. Similar to confidence intervals, when the data do not meet the requirements needed to use traditional hypothesis testing methods, such as the - or -test, bootstrap hypothesis tests are an alternative so long as their own assumptions are met. Many early manuscripts and textbooks [e.g.][]beran1988prepivoting, hinkley1988bootstrap, tibshirani1993introduction,davison1997bootstrap give guidance on bootstrap hypothesis testing and discuss possible approaches. The approach that we outline next is based on the idea of using a pivotal quantity. Readers who desire more details about this approach should reference Chapter 4 of davison1997bootstrap and Chapter 16 of tibshirani1993introduction. A summary is given in Table 2 for readers who are already familiar with these concepts.

In general, to conduct a one-sample level- bootstrap hypothesis test of , two components must be obtained: (1)

, a test statistic, and (2)

, an estimate of, , the distribution of , under . Pivotal bootstrap hypothesis tests use test statistics whose distributions do not depend on any unknown parameters, including , so that only needs to be estimated, without regards to .

Using the plug-in principle, bootstrap test statistics, , can be generated from the bootstrap sample data and used to estimate . The accuracy of this estimate depends on how well the distribution of approximates that of . As was the case when estimating quantiles for confidence intervals in the last subsection, the two will conform well when is actually pivotal.

For a one-sided lower alternative hypothesis, that is , we can calculate the achieved significance level, an approximate p-value, with

 ASL=P∗(t(x∗)

Here

is the observed test statistic, and we use an asterisk to note that this approximate probability is calculated using the distribution of the bootstrap test statistics. If the alternative hypothesis is one-sided upper, then

 ASL=P∗(t(x∗)>t(x))

and if it is two-sided, then

 ASL=P∗(t2(x∗)>t2(x)).

In all cases, we reject if , where is the desired significance level. Note that the form of the two-sided ASL assumes that the distribution of bootstrap test statistics is symmetric about zero. In general, the calculation of a two-sided p-value works best when the distribution of the test statistic is symmetric about zero [see][]mudholkar2009defining, dunne1996two.

Studentized pivot: Suppose that is a pivotal quantity - its distribution does not depend on . Then the bootstrap hypothesis test outlined above may be used. In this case, is estimated with the distribution of and the observed test statistic is . Depending on the alternative hypothesis, the can be calculated using one of the expressions given earlier.

Note that, if is contained in the studentized interval given in Table 1, then

 z∗((B+1)(α/2))<(^θ(x)−θ0)/^SE(^θ(x))

The quantity in the center is , the observed test statistic based on a studentized pivotal quantity. If we estimate the and quantiles of the distribution of with the -th and -th smallest values of the distribution of , then containment of in the studentized interval implies that is in the rejection region of the two-sided level- bootstrap hypothesis test. Therefore, performing this hypothesis test is equivalent to rejecting values of which are not contained in the studentized interval.

Locational pivot: If we suppose, instead, that is a pivotal quantity, then the bootstrap hypothesis test can again be used in a similar manner. In this case, the observed test statistic is and the bootstrap test statistics, , can be used to estimate . The ASL can be calculated using the statements defined earlier.

If is contained in the basic bootstrap interval of Table 1 then,

 r∗(B+1)(α/2)−^θ(x)<^θ(x)−θ0

If it is contained in the percentile interval, then

 −(r∗(B+1)(1−α/2)−^θ(x))<^θ(x)−θ0<−(r∗(B+1)(α/2)−^θ(x)).

Again, we see that the quantity in the center of each statement is the observed test statistic, based on a locational pivot. Furthermore, by the symmetry assumption of the percentile interval, the bounds of these statements are the same. If the -th and -th smallest values of the distribution of are used to estimate the and quantiles of , then the values in the rejection region of this test are the same as the values contained in the basic bootstrap interval, or the percentile bootstrap interval under the symmetry assumption.

The use of pivotal quantities is not unique to bootstrap hypothesis testing. The - and -tests use the same underlying idea, with additional assumptions about the shape of the distribution of the test statistic. When the test statistic is not approximately pivotal, the performance of these bootstrap hypothesis tests may be negatively impacted. For reference, these bootstrap hypothesis tests are summarized in Table 2.

3.3 Summary

The theoretical underpinnings discussed in this section show that two-sided basic, percentile, and studentized bootstrap intervals, and their corresponding hypothesis tests, rely heavily on the assumption that the distribution of can be made approximately pivotal through shifting (by ) or studentization (shifting by and scaling by ). Whether this is a reasonable assumption depends on the parameter of interest and the underlying population data. In many cases, such as that of the sample mean, the distribution of will depend on some scale parameter and the former assumption will be unreasonable. In the next section, we use simulations to investigate how these bootstrap methods perform when their assumptions are or are not reasonably met.

4 Simulation-Based Performance Evaluations of Bootstrapping

To evaluate the performance of these bootstrap intervals and their corresponding hypothesis tests, we applied their two-sided versions under a variety of simulated scenarios where their assumptions were or were not reasonably met. We discuss the following performance metrics:

Coverage proportion (C): the proportion of two-sided intervals that contained the true parameter value. For a bootstrap interval, it is desirable to have this equal to .

Significance level ()

: the proportion of times that the null hypothesis was rejected, in favor of a two-sided alternative, when it was actually true. In light of the bootstrap hypothesis testing methods discussed,

. That is, the proportion of times that was rejected in favor of , where , the true population parameter, at the significance level, is equal to the proportion of two-sided intervals that did not contain the true parameter value.

Power (): the proportion of times the null is rejected, in favor of a two-sided alternative, when it is in fact false. It is usually desirable to have this value increase to 1 as the sample size increases. For more insight, we studied the behavior of as increased, for a variety of increasing sample sizes. Since the corresponding two-sided bootstrap hypothesis tests reject any values that are not contained in the two-sided interval, this is simply the proportion of two-sided intervals that did not contain each hypothesized value of , where is some constant specifying the absolute distance from the truth.

For simplicity, we call these performance metrics by their theoretical names, however, our results are simulation-based and, therefore, some deviations from what we would expect based on statistical theory can be expected.

Results pertaining to the proportion of intervals or hypothesis tests which exhibited some behavior (e.g. containment of a true or false parameter value) were calculated out of 10,000 intervals or tests, each constructed using a different random sample taken under the specified simulation constraints. However, in some cases, such as when the sample size was small, there was little to no variability to estimate and this produced studentized bootstrap intervals with undefined bounds (0/0 or a value divided by 0). In these cases we only considered intervals that did not contain undefined values when calculating performance metrics, so the performance metric was calculated out of fewer than 10,000 intervals. More information is given on this behavior as we discuss the simulation results and we note how many undefined intervals were observed in the results tables.

All studentized bootstrap intervals were constructed using the iterative method discussed in Section 3. We elected to use the bootstrap estimate of standard error for the studentized interval in order to gain insight into the performance of the method when a formula for the standard error is not available.

In order to determine if there was any difference in the performance due to the number of bootstrap samples used, bootstrap intervals were constructed using both and bootstrap samples. Also, the significance level was kept at throughout. That is, all confidence intervals were constructed with a desired coverage probability and hypothesis tests were conducted with a desired Type I error rate of . For comparison purposes, we included simulation results for traditional and/or methods as appropriate for a given problem. These were the one-sample - and -tests and intervals for the mean and the one-sample -test and interval for the proportion (Wald interval). The details of these methods can be found in most any introductory statistics textbook.

4.1 Simulation results

We began with the problem of constructing interval estimates for the population mean under different scenarios. In the first scenario, random samples of size , and were taken from a Normal population. In the second scenario, random samples of size , and were taken from an Exponential population (with rate parameter ), which is a right-skewed distribution. Connecting this problem to the notation used in the previous section, we have , the sample mean, and , the population mean.

To determine if the assumptions of the basic and percentile bootstrap intervals were met, we calculated ten thousand times using samples of varying sizes from a variety of Normal and Exponential populations. We selected values for such that the underlying population distributions would have less right-skew as increased. For the Normal population, and were chosen such that the spread and center of the underlying population slightly varied. Figures 0(a) and 0(b) give the distributions of shifted sample means.

For Normal populations, it was clear that the spread of the distributions of depended on the population variance. For example, in the first row, third column of Figure 0(a), the spread of the distribution is greatest, while in the first row, fourth column it is least. This corresponds to changes in the variance of the underlying population. For the Exponential populations, inconsistencies were also observed between distributions as the skew and spread varied with . However, as the sample size increased, the distributions became more consistent across populations. For both scenarios, we concluded that the assumptions of the basic and percentile intervals were not met when the sample size was small but became better met as it increased.

To determine if the assumption of the studentized bootstrap interval was met, we used the same simulations as before, but now we checked whether the distribution of was approximately the same across the different populations. Here is the bootstrap-based plug-in estimate of standard error defined in Section 3. We may refer to scaling by this estimate more broadly as “studentization”.

The simulated distributions are given in Figures 1(a) and 1(b). The dark and light gray shading correspond to the use of and bootstrap samples, respectively. The value of did not have an impact on the resulting distribution, but the sample size did. The first is evidenced by the strong overlap of the dark and light gray bars in the histograms and the second by the differences between histograms within the same column.

For example, comparing the distributions in the first column of Figure 1(a), the spread in the distributions slightly decreases as increases. However, making comparisons across the first row of Figure 1(a), the distributions are approximately the same in shape, spread, and center. For these reasons, we concluded that the assumptions behind the studentized interval were met when the underlying population was Normal or Exponential and the parameter of interest was the mean.

It is known that the -interval does not perform well when is small and the data are skewed [see][]huang2017uncertainty, meeden1999interval. Therefore, the assumptions of the -interval were reasonably met in the first scenario, where the underlying population was Normal, but less reasonably met in the second scenario, where the population was right-skewed and was small. The assumptions of the -interval were met in both scenarios since samples were independent and identically distributed (iid) and the underlying population variance was technically known.

The coverage proportions of the bootstrap intervals and the - and -intervals for the mean are given in Table 3 for each scenario of interest.

When the underlying population was Normal, the coverage proportions of the - and -intervals were very close to the nominal 0.95, for most all values of . Larger discrepancies were observed for the bootstrap intervals though. The percentile and basic bootstrap intervals had moderate under-coverage, especially for small . The lowest coverage observed amongst these two intervals for the Normal(1,1) population was 0.902. Alternatively, the studentized interval had over-coverage with proportions as large as 0.965.

When the population was Exponential, the coverage proportions of the -interval dropped well below 0.95, while those of the -interval reached above 0.95. Pointed decreases in the coverage proportions of the bootstrap intervals were also observed. The most severe changes were observed for the percentile and basic bootstrap intervals for . In these cases, some coverage proportions dropped by over 10%.

The coverage proportions of the studentized interval were higher than that of the -interval when small samples were taken from an Exponential(1) population. However, the widths of the studentized bootstrap intervals were significantly larger than those of the -intervals, especially when was small. Figure 3 gives the distributions of the widths (upper bound - lower bound) of the studentized and -intervals for each value of when the underlying population was Exponential. These were plotted on the log scale for ease of visibility. The dashed lines in each panel mark the width of the -interval, which is constant for a fixed and significance level. The widths of the studentized interval were quite large and varied greatly, especially for . This explains why the coverage proportions were higher than that of the -interval in this case.

Large widths were observed when the denominator of either or was near zero and this occurred when there was little variability between the second-level bootstrap sample statistics. In some extreme cases, all of them were the same and the second-level bootstrap estimate of standard error was exactly equal to zero, giving undefined values for or . This behavior was observed in 52 (out of 10,000) intervals for the population mean. These intervals were removed before calculating the coverage proportions in Table 3.

In the case of the population proportion, which we discuss next, there was even less variability to estimate since only TRUE or FALSE was sampled. Therefore, the original sample statistic, bootstrap sample statistics, and second-level bootstrap sample statistics were all the same in some cases. This also produced estimates of zero for the first- and second-level bootstrap estimates of standard error and, therefore, undefined bounds for the studentized intervals. We only considered intervals that did not have undefined bounds when calculating performance metrics.

The population proportion, , is analogous to the mean for binary data. We evaluated the performance of the bootstrap intervals and the -interval for proportions, also called the Wald interval, under a variety of scenarios. Connecting the notation from Section 3 to this problem, , the sample proportion, while , the population proportion. We selected samples of size , and from Bernoulli populations with . These values were selected so that the distribution of would vary from right-skewed when and were small, to symmetric when was large and was 0.5.

The distributions of shifted and studentized sample proportions for samples from Bernoulli populations were given back in Figures 0(c) 1(c), respectively. For or , the distributions of shifted and studentized sample proportions differed noticeably as increased. These differences subsided slightly as increased, though they were still noticeable. When was small, we obtained some zero estimates for the standard error of , resulting in undefined studentized sample proportions. These were removed before plotting, which likely contributed to inconsistencies between the distributions in Figure 1(c). Due to these observations, we again concluded that the assumptions of these bootstrap intervals were not well met in this scenario for small , but, were better met for large sample sizes.

The -interval for proportions is known to be inappropriate when the sample size is small and is near zero or one [see][]newcombe1998two, brown2001interval. When is near zero, the distribution of the number of successes, and therefore the proportion of successes, is right-skewed and when is near one, it will be left-skewed. When the sample size is additionally small, this skewness makes the Normal approximation inappropriate.

The coverage proportions of the bootstrap intervals and the -interval for the population proportion are given in Table 4. The coverage proportions of most intervals were quite far from the desired 0.95, regardless of , , or .

The basic bootstrap interval mostly had under-coverage: for and , it never achieved a coverage proportion at or above 0.95, though it got close with 0.938. For or and , the coverage proportions came closer to the desired 0.95. The percentile interval had a mixture of over- and under-coverage. For , there was only under-coverage but results were not consistent for other values of since there was both over- and under-coverage as varied.

The studentized interval mostly had over-coverage, with coverage proportions as large as 0.974 when no intervals were undefined. For , there were very few intervals whose bounds were not undefined, if at all, especially when was also small. The -interval had under-coverage for the most part: its nearest coverage proportions were 0.940 and 0.961. Its lowest coverage proportions were observed when .

Another, possibly more serious, issue that we observed pertained to the behavior of the actual intervals themselves. The basic intervals contained invalid values and the percentile intervals had bounds which were exactly equal. The -intervals also had invalid values and equal bounds and, as we already noted, the studentized intervals had undefined bounds. The frequency with which these issues were observed is given in parentheses next to the coverage proportions in Table 4.

The basic and percentile bootstrap intervals exhibited odd behavior mostly when

or was small. In one case over 61% of basic bootstrap intervals contained invalid values and, in another case, 59% of percentile bootstrap intervals had equal bounds. However, as and increased, this behavior was not observed as frequently.

Use of the studentized interval produced many undefined bounds. When the underlying population proportion was near zero, or was small, some second-level bootstrap estimates of standard error were zero, producing undefined values for the bootstrap -statistics used to construct the interval, whose divisor is this estimated standard error. If there was also no variability in the original sample, then the -statistic was , which is also undefined. Undefined values were removed before calculating the coverage proportion which is why some coverage proportions were exactly zero or one. The number of intervals which were removed before calculating the coverage proportion is given in parentheses.

The -interval exhibited behavior similar to that of the basic and percentile bootstrap intervals. This was especially true when was small or . In these cases, both invalid values and intervals with equal bounds were observed. The behavior that we observed with the -interval, and other issues that arise with its use, are also discussed by newcombe1998two and brown2001interval.

As we mentioned earlier, the achieved significance level of the two-sided bootstrap hypothesis tests, , is equivalent to . That is, since we calculated the proportion of intervals that contained the true parameter value, we also had the proportion of times we would fail to reject this true value if two-sided bootstrap hypothesis tests were performed. Subtracting this from one gave us the proportion of times we rejected this true null value. For brevity, we did not tabulate these since they are just one minus the values given in Tables 3 and 4. However, note that coverage proportions calculated when many studentized intervals had undefined bounds will less accurately reflect .

In scenarios where few or no bootstrap intervals were removed, those which had coverage proportions near 0.95 also performed well in terms of significance levels near the desired 0.05. Those that had coverage proportions above or below 0.95 rejected too often or too rarely, respectively. Since many studentized intervals, both for the mean and proportion, had coverage proportions well above 0.95, it was the more conservative method in comparison to the basic and percentile bootstrap intervals.

For direct comparison, we obtained the rejection rates of the one-sample - and - tests for the mean as well as the -test for the proportion. These are given in Tables 6 and 6. To obtain these, we performed each of these tests 10,000 times under each of the same scenarios used earlier and calculated the proportion of tests which rejected the true hypothesized value.

The rejection rates of the -test for the mean were near the desired 0.05 in most cases. When and the underlying population was Exponential, it was lowest at 0.041. When samples came from a Normal population, the -test for the mean produced rejection rates near the desired 0.05. However, a non-trivial increase in its rejections rates was observed when small samples from an Exponential population were used. This concurs with the decrease in coverage proportions that we also observed earlier. The rejection rates of the -test for proportions was consistently far from 0.05 in both directions and there did not seem to be a clear pattern to these rates as or decreased.

We also investigated the performance of the , , and bootstrap hypothesis tests in regards to their ability to reject incorrect hypothesized values for the population mean and proportion. This performance metric was defined earlier as the power of these tests.

Figure 4 gives the rejection rates of the , , and bootstrap hypothesis tests for the mean under the same simulation constraints used earlier when the parameter of interest was the mean. The studentized intervals that produced invalid results were removed before calculating these rejection rates so some rates were calculated out of fewer than ten thousand intervals.

When the underlying population was Normal, the rejection rates of the - and -test were similar, though the former rejected incorrect values slightly more often. The studentized interval contained incorrect values, which were not rejected, more often than the basic and percentile intervals. Whether or did not seem to have a noticeable impact on the rejection rates. For all methods, the rejection rates improved as increased.

This was also true when samples came from an Exponential population but the rejection curves were not nearly as well-behaved. For the -test and bootstrap hypothesis tests, the rejection rates were far less symmetric about the true mean. Both the distance and direction with which the hypothesized value strayed away from the true mean impacted the results. The rejection rates of the -test and bootstrap hypothesis tests reached one more quickly as the hypothesized mean moved below the true mean than when it moved above the true mean.

The test based on the studentized interval had even more conservative results than in the Normal case and, even as the sample size increased, the rejection rates remained lowest of all methods. The distance between the lowest rejection rates and the significance level, marked by a long-dashed line at , was quite large for the hypothesis tests based on the basic and percentile bootstrap intervals. This agrees with the low coverage proportions we observed for these intervals earlier in the same population scenario.

The rejection rates of the -test for proportions and the bootstrap hypothesis tests for the proportion are given in Figure 5.

Regardless of the method used or the value of , the rejection rates went to one very slowly for . However, for , the rejection rates went to one more quickly as strayed away from . For , larger samples where needed to more quickly achieve high rejection rates as strayed from . The rejection rates of the hypothesis test based on the basic bootstrap interval were slightly less conservative than that based on the percentile bootstrap interval and the -test. The studentized intervals performance was the most conservative. It had the lowest power since, as strayed from , its rejection rates were consistently lower than that of the other methods.

Next, we discuss the implications of these results. We also make connections to some results and suggestions pertaining to these bootstrap methods that are found in the literature. Options that teachers have for performing simulations such as these in the classroom are also presented.

5 Discussion

The results that we obtained primarily show that, when their underlying assumptions pertaining to pivotal quantities are not met, there can be non-trivial differences in the performance of the basic, percentile, and studentized bootstrap intervals. We observed decreased coverage proportions, increased Type I error rates, and decreased rejection rates when the null hypothesis was false. In many cases, the frequency with which these were observed increased as decreased and rarely improved when increased. Also, there did not appear to be an improvement in performance when these bootstrap methods were used as an alternative to traditional methods whose conditions were broken.

When the parameter of interest was the mean and the underlying population was Normal or Exponential, the shifted sampling distributions were inconsistent across populations, especially for small sample sizes. This provided evidence that the assumptions of the percentile and basic bootstrap intervals were less reasonable for small sample sizes. These intervals had lower coverage proportions and higher Type I error rates, when the sample size was small. The assumptions of the studentized interval were reasonable though, and its coverage proportions were better than that of the -interval when small samples were taken from an Exponential population.

In some cases, though, these high coverage proportions were not simply due to its superiority over the -interval, but rather due to its larger widths. When the underlying population was Normal, the coverage proportions, Type I error rates, and correct rejection rates of these methods were better than when the population was Exponential. This indicates that non-Normality in the underlying population has an impact on the performance of these methods.

Taking this, and the coverage proportions of the - and -intervals into account, we concluded that their performance was better than the basic and percentile bootstrap intervals and comparable to, if not better than, the studentized bootstrap interval. Moreover, these bootstrap intervals were not necessarily an improvement over the -interval in scenarios where it was known to have poor performance - that is, when was small and the data were skewed.

When the parameter of interest was the population proportion, the assumptions of the basic and percentile bootstrap interval still were less reasonable for small sample sizes. Their coverage proportions and Type I error rates were not at the desired levels, especially for small and . In these cases, issues with the studentized interval also became more apparent and its assumptions were not reasonable. Estimating first- and second-level standard errors was an issue when was small and, in many cases, we obtained undefined estimates. The reliability of the coverage proportions was negatively impacted by this since many intervals had to be disregarded.

In cases where few or no intervals were thrown out, the coverage proportions were high. However, this was again due to large widths rather than superiority over the other methods. This became apparent when we moved to assess the power of these methods. The studentized interval had lower power than other methods indicating that it contained incorrect values, which were not rejected, more frequently than the other methods. Meanwhile, the -interval and the basic and percentile bootstrap intervals had comparable power. Again, we found that the bootstrap methods were not necessarily an improvement over the -interval for proportions when it is known to perform poorly - that is, when and were small.

It is worth noting that the behavior of the studentized intervals may have been different if we were to use a formula to obtain estimates of the standard error of the original and bootstrap sample statistics. However, in many cases such a formula does not exist, therefore, we wanted to investigate the performance of the studentized interval when bootstrap estimates of standard error were used.

The performance of the studentized interval using a formula for the standard error of the mean was investigated by hesterberg2015teachers. They found that the studentized interval (called bootstrap interval there) performed well for small samples from Normal and Exponential populations and outperformed the percentile and basic bootstrap intervals. Our results expand on this by analyzing its performance when the data are binary and we included direct comparisons with - and -methods.

The results that we obtained further emphasize the falsehood of the claim discussed by hayden2019questionable that “they are more accurate than traditional methods for small samples”. When the sample size was small, the metrics that we observed for the bootstrap intervals were not always an improvement over those of traditional methods. Even when they were, other issues came to the forefront, such as very large widths or undefined bounds. Moreover, the assumptions behind these intervals, which pertain to pivotal quantities, were less reasonable for small sample sizes. Our results show that these bootstrap methods have non-trivial changes in performance when their assumptions, given in Tables 1 and 2 are or are not met. It is important that the bootstrap intervals that we discussed be constructed using quantities that are pivotal after shifting or studentization.

Though the unfavorable results that we obtained were generated under specific settings, they still show that there are situations in which the bootstrap can fail, especially when used for quantities which are not pivotal when the sample size is small. Therefore, it is pertinent that the assumptions of these bootstrap methods be discussed when teaching them so that students use more caution when applying them and are aware of changes in their performance when these assumptions are not met. Though their assumptions may be hard to verify in some cases, students should still be made aware of them and informed of cases where they are known to be unreasonable, such as those we reported and others given in the literature. This can help students to understand that these methods are not a direct solution when the assumptions of traditional methods are not met, but rather another option.

5.1 Tools for teaching bootstrapping

Section 4 showcased some ways in which simulations can be used to help students understand the assumptions behind these bootstrap intervals, verify if they are met, and comprehend the repercussions of applying them when they are broken. Performing similar simulations with students in the classroom can help students to better understand the methods taught so that they can reap the many benefits of applying the bootstrap and other statistical computing methods appropriately.

In order to assist with this, the functions used for the simulations in this article were complied into an R package called bootEd. These functions are straightforward applications of the interval construction methods discussed in Section 3.1. We give a minimal example of how to use the package here. More information is given in the package repository at github.com/tottyn/bootEd. The code used to perform the simulations in this article are also given there.

devtools::install_github("tottyn/bootEd")
library(bootEd)


Then we construct a 95% percentile bootstrap interval for the population median using 999 bootstrap samples:

percentile(sample = rnorm(n = 50, mean = 3, sd = 2.5), parameter = "median", B = 999,
siglevel = 0.05, onlyint = FALSE)


The sample argument takes the vector of data. The parameter argument can take any base R function (e.g. sd, mean) or any user defined summary function that returns a single value. The arguments B and siglevel take the number of bootstrap samples and significance level, respectively. When the onlyint argument is set to TRUE only the bootstrap interval is returned which can be useful for performing simulations.

The following output and plot are given:

The percentile bootstrap interval for the median is: (2.199231, 3.386698).
If it is reasonable to assume that the shifted sampling distribution of the statistic
of interest is symmetric and does not depend on any unknown parameters, such as the
underlying population variance, then this method can be used.
`

The output of the function is not only the bootstrap interval, but also information that can provoke thought about whether the assumptions of the method are reasonable. Verifying assumptions pertaining to the shifted or studentized sampling distribution can be difficult without prior knowledge of the underlying population. The bootstrap distribution is our best estimate to the sampling distribution and, though it is not exact, it can at least be used to determine if the assumptions of the method are reasonable.

In this example, the bootstrap distribution is not symmetric. Therefore, either the sampling distribution also is not symmetric or, if it is, then the bootstrap distribution is not an accurate reflection of it. Either way, the assumptions behind the percentile bootstrap interval are not met. Each of the interval construction methods discussed in this article have a separate function in the package with similar syntax, output, and plots. These can be used to gain insight into the plausibility of the assumptions behind these methods so that students can learn to use them responsibly.

When selecting a tool for statistical computing, teachers should consider the scope of the course in which the tool will be used and the computational backgrounds of the students. Teachers should avoid using tools that will bind students in the ”ritualized thinking” that learnsci indicates the teaching practices of traditional methods has unfortunately led to. Also, the criteria and goals set forth in the statistics education literature should be brought into consideration.

For example, a goal given in the GAISE is that students, “should be able to interpret and draw conclusions from standard output from statistical software packages.” [p. 8]gaise Important aspects of a contemporary statistical computing tool, which were discussed in great detail by mcnamara2018key, should also be considered. These include accessibility, ease of entry, built-in documentation, and adjustable plot creation.

Other packages that are useful for teaching bootstrapping in introductory statistic courses include: boot (bootpackage), wboot (wBootpackage), simpleboot (simpleboot), bootstrap (bootstrappackage), mosaic (mosaicpackage), and resample (resamplepackage). Though the mosaic package performs many tasks that do not pertain to bootstrapping, the do function is useful for rerunning code multiple times, as is needed for creating a bootstrap distribution. Also, the resample package has options for multiple resampling methods including the jackknife and permutation tests as well as capabilities for both one-sample and two-sample problems.

6 Conclusions

In this article, the percentile, basic, and studentized bootstrap intervals and their corresponding hypothesis tests were discussed. We showed that these methods have important underlying assumptions which should be discussed in the classroom. Performance metrics such as the coverage proportion, Type I error rate, and power were obtained under a variety of simulation scenarios. It was shown that the performance of these intervals differs non-trivially when their assumptions pertaining to pivotal quantities are or are not met. Specifically, when the sample size was small, these assumptions were less reasonable.

The performance metrics of their traditional counterparts, - and -methods for the mean and the -interval for proportions (Wald interval), were also obtained under the same simulated scenarios. We found that when the assumptions of traditional methods were not met, these bootstrap intervals were rarely an improvement. Furthermore, their performance was also impacted by a small sample size and non-normalcy.

When teaching these bootstrap methods, it is pertinent that teachers emphasize that they are not substitutes for traditional methods nor are they solutions for issues that arise from having a small sample size. Their assumptions pertaining to pivotal quantities should be clearly communicated in lectures, course materials, and textbooks so that students leave the classroom with a broader understanding of these methods and how they relate to traditional methods. Teachers should aim to make students well informed about situations where these methods are already known to perform poorly and equip them with the ability to judge whether these methods are best for a given situation.

These methods can also be used as a conceptual stepping stone to teaching more traditional methods. For example, hesterberg2015teachers suggests that the distribution of studentized bootstrap sample statistics can effectively be used to evaluate whether CLT-based methods are appropriate for the specific data set. This could be done in addition to checking whether the sample data are Normal or the sample size is above 30. When a formula for the standard error is not available, however, the computational intensity and observed “small-” inaccuracies of the second-level bootstrap estimate of standard error should be kept in mind.

Our results pertain to the performance of the basic, percentile, and studentized bootstrap intervals for the population mean or proportion. While we studied Bernoulli populations with , we did not investigate the performance of these methods for . The performance of these methods could possibly be different as the population proportion increases past . We also have not discussed the performance of the “better bootstrap intervals” introduced by tibshirani1993introduction which are said to be an improvement. There is room for comparison between those bootstrap intervals and the ones that we have discussed here.

Future work could include an assessment of the performance of these methods when order statistics, such as the median, or non-location parameters, such as the correlation and variance, are used. When the data are skewed or there are outliers, we may encourage students to use the median as a measure of center. Statistical methods for the sample median that may be taught in undergraduate introductory statistics courses include the bootstrap intervals we have already discussed and the Sign test. The latter is known to have performance issues, in terms of power and Type I error, when observations are tied [see][]coakley1996versions, fong2003use. A comparison of the performance of the bootstrap intervals and the Sign test in these scenarios could also be investigated.

Also, an effort could be made to assess the performance of these methods in two-sample scenarios and compare their performance to that of permutation or two-sample traditional methods. An assessment of another claim discussed by hayden2019questionable - that these methods are easier for students to understand could also be undertaken. A qualitative analysis of student understanding and engagement with different forms of instruction and content could be used to accomplish this.

The use of statistical computing in the classroom equips students with a variety of tools to use in many situations. It also increases students’ retention of concepts and aids the teacher in explaining complex topics. With this article, we aim to benefit both teacher and student by making them aware of the assumptions behind simple bootstrapping methods which pertain to pivotal quantities. We did this so that they can better teach and implement bootstrapping with their introductory statistics students. It is important that these students understand the usefulness and the correct scope of this tool before leaving the classroom so that they are well equipped to handle a variety of situations.

Supplemental Materials

bootEd:

The R package bootEd is available at https://github.com/tottyn/bootEd. The GitHub repo contains instructions on downloading the package and information for getting started. (Web link)