Statistical Inference on Tree Swallow Migrations, Using Random Forests

10/26/2017 ∙ by Tim Coleman, et al. ∙ 0

Species migratory patterns have typically been studied through individual observations and historical records. In this work, we adopt a data driven approach to modelling the presence of the North American Tree Swallow, Tachycineta bicolor, throughout the Eastern United States, using data collected through the eBird project at Cornell University's Lab of Ornithology. Preliminary models suggest a qualitatively different pattern in Tree Swallow occurrence between the years of 2008 to 2009 and 2010 to 2013. We implement a global hypothesis test based on the functional predictions of Random Forests (RFs) to evaluate whether this effect is significant or not. In order to better understand the effect of climate change, we also conduct a test evaluating the effect of daily maximum temperature anomaly in predicting tree swallow occurrence. We implement a local test using the asymptotic normality of the predictions of a modified RF, which relies on subsampled trees. This test is conducted at 6 locations in space throughout the northeastern U.S. Finally, we present visual evidence that maximum temperature is affecting the predictions of RF models via a heat map of the differences in RF predictions. We also demonstrate that there is a spatial pattern in the effect using Moran's I statistic.



There are no comments yet.


page 3

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Ornithological Background and Motivation

Tree Swallows (Tachycineta bicolor) are migratory aerial insectivores. In a recent breeding season study, Winkler et al. (2013) suggested that maximum daily temperature during the breeding season had a significant effect on the abundance of the flying insects that are the primary food source of Tree Swallows. This local-scale study conducted in upstate New York established how cold snaps, defined as two or more consecutive days when the maximum temperatures did not exceed 18.5C, can result in a diminished food supply, thereby suggesting an indirect link between lower temperatures and lower fledgling success. Other work supports the hypothesis that migratory birds, like Tree Swallows, have breeding patterns that are affected by climate change, (Dunn and Winkler, 1999; Hussell, 2003). While these papers focus heavily on breeding success and food availability, in this work, we investigate the effects of temperature on regional and local patterns of species occurrence during the autumn migration.

Most ornithological studies rely on controlled, local or regional level studies during a single season of the year, limiting the spatial and temporal scope of the analysis. The eBird project (Sullivan et al., 2009, 2014) hosted by the Cornell University Lab of Ornithology is a global bird monitoring project that allows for analysis on a much larger scale. This citizen science project compiles crowd-sourced observations of bird sightings, opening the door for a more data-driven approach to formally investigating scientific questions of interest. The eBird project harnesses the efforts of the bird-watching community, by encouraging bird-watchers (birders) to record checklists of the species they encountered on a particular outing. This allows for large-scale observational studies in lieu of more traditional small-scale experimental studies. These data have been used in a range of applications such as describing bird distribution across broad spatiotemporal extents (Fink et al., 2010, 2018), prioritizing priority habitat to conservation (Johnston et al., 2015), and identifying continental-scale constraints on migratory routes (La Sorte et al., 2016).

Using data collected as part of the eBird project, we study Tree Swallow populations during the autumn migration. During this time, the species are believed to be facultative migrants. Facultative migration is an opportunistic migration strategy where individuals migrate in response to local conditions, like the prevailing food supplies or weather conditions. Specifically, we study the low elevation New England / Mid-Atlantic Coast stretching north from the Chesapeake Bay to Boston, known as Bird Conservation Region 30 (BCR30) (Sauer et al., 2003), that forms the northern extent of the Tree Swallow winter range (Figure 1).

Figure 1: Map of the eastern United States. The study region, BCR30, is highlighted in yellow.

Anecdotal accounts from bird watchers in this region suggest that Tree Swallows inhabit this region for prolonged autumn periods only during relatively warm winters. In particular, in the years 2008 and 2009, it was widely believed in the ornithological community that the species departed earlier than usual in starting its southern migration. Alternatively, mortality in the northern parts of the range in those years may have been higher. Explanations for this phenomenon were not immediately obvious, but many believed that local temperature fluctuations were a principal cause. Thus, our primary objectives in this work are twofold: (1) To formally test whether the temporal pattern of Tree Swallow occurrence during the autumn migrations of 2008 and 2009 were substantially different than what would be expected during a typical autumn migration and (2) If there are differences, to investigate the association between local-scale patterns of occurrence and daily maximum temperature across broad geographic extents.

1.2 Challenges in Modelling Tree Swallows

While the ecological questions in the previous section are relatively straightforward to pose, providing accurate answers and provably valid statistical inference is challenging. In general, we expect that as daily temperatures decrease, the occurrence rate of Tree Swallows should also decrease as the species gravitates towards regions with more plentiful food or suffers higher mortality where food availability has been driven down by cold maximum temperatures. However, there are many strong sources of variation affecting the observed local-scale spatiotemporal patterns of species occurrence during the migration; sources of variation that can modify and mask the local-scale effects of temperature. Ecological patterns of local-scale occurrence are affected by elevation, land cover types (e.g. open fields vs forests), and weather. Because of the difficulty finding and identifying birds in the field, variation in detection rates further complicates modeling and inference about the underlying ecological processes. Based on previous work (see, for example, Zuckerberg et al. (2016)), we expect such effects to appear as complex, high-order interactions among the available covariates and that many of these effects and interactions will vary throughout the autumn migration.

Thus, one of the main analytical challenges is to develop models that can exploit rich covariate information to account for varied and complex sources of variation while facilitating statistical inference about potentially complex, local-scale effects. The large amount of available data, together with the presence of both nonlinear and high-order interactions, complicates the use of most traditional parametric and semiparametric models for this task. Thus, we rely on the more flexible alternative offered by random forests (Breiman, 2001)

which are designed to capture complex covariate interactions and naturally accommodate large numbers of predictors. Random forests have a well-documented history of empirical success and are generally considered to be among the best “off-the-shelf” supervised learning methods available

(Fernández-Delgado et al., 2014). This strong track record of predictive accuracy makes them an ideal “black-box” model for complex natural processes. Similar tree-based methods have also been used previously with success in other eBird related work (Robinson et al., 2017; Fink et al., 2018).

Though black-box models, almost by definition, are generally not amenable to statistical inference, recent asymptotic results from Mentch and Hooker (2016) and Wager et al. (2014)

on the distribution and variance estimation of predictions resulting from RF models provide a formal statistical framework for addressing our primary questions of interest. Moreover, as we demonstrate, traditional non-parametric inferential procedures can also be used to help draw inferences from these complex models.

In this paper, we begin with a brief overview of the data and available covariate information in Section 2. In Section 3, we provide further evidence for use of random forests to answer the questions posed earlier. We then construct preliminary RF models to assess the influence of temperature and produce maps of prediction differences between models to understand the spatial patterns in the effect of maximum daily temperature. In Section 4, we develop a permutation-style test to investigate how unusual the 2008 and 2009 migration patterns appear to be by treating the RF predictions over time as functional data. Finally, In Section 5, we make use of recent asymptotic results to test the significance of maximum daily temperature at a variety of local test locations throughout BCR30.

2 Data Overview

The eBird data we utilize contains both spatial and temporal information as well as covariates that describe the observation/detection process of the observer. This last set of predictors is included in order to account for variation in detection rates, a potential confounder when making inference about species distributions. Our outcome of interest is the probability that at least one Tree Swallow is observed given the spatial, temporal, and detection process information. We refer to this probability as

occurrence. Because we are interested in the eastern autumn migration, we restrict our attention to eBird observations located in BCR30 that were recorded on or after the 200th day of the year between the years 2008-2013. In total, the full dataset, , contains 173,002 observations on 30 variables.

Spatial information is captured by land cover and elevation data. To account for habitat-selectivity each eBird location has been linked to the remotely-sensed MODIS global land cover product (MCD12Q1) (Friedl et al., 2010). Here we use the 2011 MODIS land cover data as a static snapshot of the landcover. These landcover predictors were associated with eBird observations, collected from 2004 to 2012. Finally, we use the University of Maryland (UMD) classification scheme (Hansen et al., 2000)

to classify each 500m

500m pixel (25 hectare) as one of 14 classes, including classes such as water, evergreen needleleaf forest, and grasslands.

We summarized the land cover data as the proportion of each land cover class within a 3.0km 3.0km (900 hectare) pixel centered at each location using FRAGSTATS (McGarigal et al., 2012). Summarizing the land cover information at this resolution reduces the impact of erroneous land cover classifications and user-specified location information while accounting for the motility of birds.

Temporal information is included at three resolutions. At the finest temporal resolution, the time of the day at which the observation was made is used to model variation in availability for detection; e.g., diurnal variation in behavior, such as participation in the “dawn chorus” (Diefenbach et al., 2007), may make species more or less conspicuous. The day of the year (1-365) on which the search was conducted is used to capture day-to-day changes in occurrence, and, similarly, the year of the observation is included to account for year-to-year differences. For our purposes, we restrict our attention to the day of year (DoY) and the year itself, corresponding to our interest in anomalies in the fall migration.

Temperature data was collected from the DayMet project hosted by Oak Ridge National Lab (Thormton et al., 2017). The data includes daily maximum (max_temp), minimum, and mean temperature for each day in the training period. We also estimated an expected daily maximum temperature for each day by taking the mean daily maximum temperature for each eBird location from 1980-2007. The anomaly relative to this expected maximum (max_temp_anomaly, defined as max_temp minus the 1980-2007 normal max_temp) is of particular interest since max_temp alone is strongly correlated with DoY across years. To account for the effects of elevation, each eBird location is associated with the 30m gridded elevation from the ASTER Global Digital Elevation Model Version 2.

Finally, there are three user effort variables included in the model to account for variation in detection rates: the hours spent searching for species (eff_hours), the length of transects traveled during the search (eff_dist), and the number of people in the search party (n_obs). In addition, an indicator of observations made under the “traveling count” protocol was included to allow the model to capture systematic differences in species detection between the the counts recorded by traveling and stationary birders.

3 Preliminary Models

In this section, we provide further evidence for the use of random forests to model Tree Swallow migration. As noted earlier, random forests are typically used in datasets with many observations on many predictors, whose effect on the response may be a complex, nonlinear function of the predictors. As demonstrated in Fernández-Delgado et al. (2014), though random forests do not universally dominate other methods, they are exceptionally accurate and robust supervised learners.

Fundamentally, however, our interest in this work is in scientific understanding and statistical inference and certainly there are numerous alternative statistical models that provide a more direct means of accomplishing this. However, to trust such inference, we must trust that the model selected is able to accurately capture the complex underlying mechanisms. This suggests the question: are there notable gains in accuracy by using random forests, or would a more straightforward statistical model suffice? To answer this using the eBird data, we first perform a 3-fold cross validation (CV) analysis using a variety of popular techniques for predicting binary outcomes. In particular, we train a random forest with mtry = 5 (not chosen by cross validation) and 500 trees, a

-nearest-neighbors (KNN) regression model with

chosen from

, a 3-layer artificial neural network (ANN) with the number of neurons chosen from

at each layer (Bergmeir and Benítez, 2012)

, Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and a Generalized Additive Model (GAM) with degrees of freedom chosen from

(Hastie, 2017)

. Finally, we train an elastic-net penalized logistic regression (GLMNet) model

(Friedman et al., 2010), with weights , (0 corresponds to the LASSO,

corresponds to Ridge Regression), with cost parameter

. This model fits coefficients to all covariates and also every two-way interaction, to parsimoniously select the strongest interaction models. These models are trained using the caret package (Kuhn, 2017) in R, and, with the exception of random forests, the parameters chosen reflect those which lead to the smallest CV estimate of Root Mean Squared Error (RMSE). We also report the CV estimate of the Mean Absolute Error (MAE).

Figure 2: 3-fold CV estimates of RMSE and MAE for various predictive models, plotted in descending RMSE order.
Random Forest mtry = 5 0.22280 0.10863
ANN (100, 30, 15) 0.25300 0.11465
KNN () 0.25558 0.12195
GLMNet ( = 0) 0.27260 0.16163
GAM () 0.27380 0.17412
LDA 0.35350 0.19400
QDA 0.44427 0.24541
Figure 3: 3-fold CV estimates of RMSE and MAE for various predictive models, tabulated in ascending RMSE order. Parameters listed correspond to those chosen by the CV analysis, except for the random forest mtry parameter.

The results of this analysis are displayed in Figure 3 and are tabulated in Table 3. Even without tuning, the random forest model attains the lowest RMSE and MAE scores with the other flexible models, such as KNN and the ANN, not far behind. We see that the GAM and GLMNet models are similar in performance, with LDA and QDA lagging severely behind. Notably, the GLMNet model selected a tuning parameter that maximized model complexity, i.e.

, even with all two-way interactions considered. This suggests that a parametric model is unreasonable due to the complex interactions and functions of the covariates that go into predicting occurrence. An off-the-shelf random forest produces meaningful performance gains over other models which were tuned. The strong predictive performance of random forests, combined with the recent advancements in inferences for random forests, makes them an ideal model for drawing conclusions about tree swallow migrations. As a first step analysis, we make use of traditional means for drawing inferences from black-box methodology of RFs with tools such as partial effect plots and conclude with a spatial analysis of the effect of


. These analyses provide heuristic and provisional answers to the study questions, and motivate the more formal testing procedures developed and executed in later sections.

3.1 Inspecting Annual Migration Differences

Recall that the initial motivation for this study was a widely perceived difference in Tree Swallow autumn distribution patterns in the years 2008 and 2009. In particular, it was believed that Tree Swallows had remained in the northern regions for a shorter period in the fall during 2008-2009, but the ornithological community was unsure of the mechanism(s) behind this earlier departure/decline. Accordingly, we begin by partitioning the BCR30 Tree Swallow data into two training samples: one containing observations from 2008-2009 and the other containing the observations from 2010-2013. Formally, denote the entire training set as so that we can write our partitioned training datasets as and , respectively, with . For each day of the year beginning with DoY 200, 100 points were selected at random from to serve as a validation set. These 16600 points were then removed from the corresponding training set.

We first construct a RF on each of these temporally divided training sets. It is important to note however, that the eBird project has grown substantially in popularity since its inception in 2002 and thus later years contain many more observations than earlier years. In particular, contains a total of 21,907 observations, while contains 151,095. Because a RF trained on a larger dataset may be more stable, any differences observed between predictions generated by the two datasets may be partially explained by the difference in data sizes. To account for this, we also selected (uniformly at random, without replacement) a subsample of size 21,907 from the training data and with it, constructed a third RF. Predictions were made at all points in the validation set, and averaged by day. The results are shown in Figure 4.

Figure 4: Predicted occurrence by random forests trained on , , and a subsample of . Predictions shown are kernel smoothed estimates of the prediction surface using a Gaussian kernel with a bandwidth of 5.

From Figure 4, we see that the RF trained on predicts the largest occurrence until approximately DoY 285, after which the 2010-2013 forests are higher until approximately DoY 320, from which point the differences appear negligible. This seems to support the hypothesis that during the years 2008 and 2009, Tree Swallows remained in northern regions longer before departing more quickly. Importantly, the predictions from the RF trained on the reduced dataset from 2010-2013 forest differs only very slightly from those generated by the RF trained on the full data, suggesting that the more substantial departures observed between predictions generated by RFs trained on and have little to do with the differing training sample sizes.

An additional focus of our work is to investigate the significance and impact of maximum temperature (max_temp) in predicting occurrence. Using the entire training dataset, we estimated a partial effect function of max_temp with DoY removed as a covariate (Figure 5), to account for any confounding effects between DoY and max_temp. These functions are estimated by discretizing max_temp in the training data, , into a grid. For each grid point, predictions are made at each observation whose discretized max_temp corresponds to the grid value. The partial effect value at the grid point is then the average of all the predictions at that grid point. As expected, we see a steady increase in occurrence with max_temp starting around 7C which appears to begin leveling off around 32C. Interestingly, the sharpest increase appears to occur around C, which corresponds to a period of heightened insect activity suggested in Winkler et al. (2013). The partial effect plot further suggests that max_temp has an important effect on occurrence.

Figure 5: Partial effects plot for max_temp for a RF trained on the entire dataset. Shaded region corresponds to range of htemperatures where potentially flying insects shift from inactive to flying (Winkler et al., 2013).

Finally, recall that out-of-bag (oob) variable importance measures are a popular ad hoc measure that usually accompanies random forest predictions. Breiman (2001) introduced these oob measures as a means of quickly assessing variable importance by calculating the decline in prediction accuracy observed when the values of a particular variable are permuted amongst the oob samples. According to this metric, max_temp was determined to be the most important covariate, though we note this with caution as the oob measures are often unreliable. A substantial amount of previous work – see Strobl et al. (2007); Nicodemus et al. (2010); Toloşi and Lengauer (2011); Hooker (2007) for popular examples – has demonstrated serious flaws with such measures, most notably that they tend to inflate the importance of groups of correlated covariates. This is especially problematic in our context, where daily maximum temperature is one of several highly correlated spatial covariates of varying importance.

3.2 Visualizing the Spatial Effects of Maximum Temperature

Here, we examine the spatiotemporal effect of max_temp on occurrence throughout the Northeast. We construct two RFs, one with the original data and one with max_temp permuted, and compare the predictions generated by these two models. In both cases, we exclude DoY as a predictor. To remove variation associated with the detection process, a nuisance when investigating the effect of maximum temperature, we set the observer characteristics (n_obs, eff_hours, and eff_dist) to 1 which coincides with typical levels for many ebird observers. The test set consists of the points in the 3km grid within the longitude/latitude region, where landcover characteristics (UMD classes, elevation) are concatenated with max_temp. The values assigned to max_temp

in the test set were imputed from the 2014 DayMet observations, providing temperature information that was collected independently of the

max_temp values in .

We then make predictions using both forests and calculate the difference in predictions. Formally, for a test point in the grid , define the difference in predictions between the original and permuted forest as

where denotes the original training data, denotes the training data with max_temp permuted, and and denote the RFs trained on such data, respectively. In order to examine the temporal dynamics, we create 9 test grids, each 20 days apart, throughout the fall.

Figure 6: Predicted occurrence differences () between original and permuted RFs calculated at 9 time points throughout the fall. Red indicates larger predictions from the original RF; blue indicates larger permuted RF predictions.

The resulting heat maps of prediction differences demonstrating the effects of max_temp are shown in Figure 6. Red indicates the predictions of the original RF being higher than the permuted RF; blue indicates that the permuted forest made larger predictions. We see that earlier in the fall followed by roughly equal predictions around day 161 of the fall, and finally, by winter the original RF is making substantially lower predictions than .

Under the assumption that max_temp is unrelated to the response and therefore simply noise, we might expect that the differences in predictions between the original and permuted RFs across space are also simply random, uncorrelated noise. A Moran’s I test, (see Appendix A), provides strong evidence that the differences plotted in Figure 6 exhibit spatial autocorrelation. The purpose of this test is to search for local effects of max_temp in RF predictions; if max_temp is meaningful in predicting occurrence, we would expect the differences in predictions between two points near each other to be more strongly correlated than two points further away. This provides statistical backing to what is clear from Figure 6: there is certainly a “clumping” among the differences between the random forests, suggesting that max_temp’s effect has local homogeneity.

The results of the Moran’s I test should be interpreted with some caution; the predictions from the forests and are functions of the test set max_temp which itself has strong spatial correlation. Thus, the predictions and their differences may exhibit correlation regardless of the true association between occurrence and maximum temperature. A more direct approach to assessing the predictive importance of maximum daily temperature is provided in Section 5.

4 Testing for Regional Differences in Occurrence

Figure 4 in the previous section shows the predicted occurrence over time by both a RF trained with only as well as predictions from a random forest trained on . We now devise and implement a permutation test to explicitly assess whether the prediction curves in Figure 4 exhibit differences that could plausibly be due to chance. Here we consider regional hypotheses, meaning that we investigate differences in species occurrence throughout the entire BCR30 region as opposed to at a specific location or set of locations within that region.

4.1 Testing Procedure and Data

Treating the random forest prediction curves in Figure 4 as a function of time, denote the curve trained on the 2008-2009 data as and that trained on the 2010-2013 data as for . To account for differing training set sizes and also in the interest of both computational efficiency and being conservative in our testing procedures, we now construct a reduced training set from the data containing the same number of observations as . We construct this reduced set by taking each observation in and drawing a radius around it in both space (0.2 decimal degrees in both latitude and longitude, an area of approximately 352 km) and time (2 days). We then locate all observations from within this radius and select from these an observation uniformly at random, without replacement. This produces a sort of “nearest neighbor” training set, , with roughly the same spatiotemporal distribution of observations, allowing us to more closely examine the influence of the other covariates on the functional observations.

Our test set consists of points, with points taken for each day in the fall. We construct this test set by sampling 1000 locations from the grid used to make the heat maps in the previous section, referenced with their land cover and elevation characteristics, as well as max_temp for that day. The maximum temperature information included is the expected daily maximum temperature, estimated from the 1980-2007 temperature information provided by DayMet. The variables associated with the eBird user (e.g. eff_dist, eff_hours, and n_obs) are set to 1 uniformly, to again represent typical eBird user levels.

To make our prediction function more explicit, let denote the prediction function of the RF trained on . Then is defined as

where is a point in the test set corresponding to time . Thus, since the test points are stratified by time, denotes the average over predictions made at all 1000 test points on each day and therefore represents a time-smoothed version of the raw RF prediction function. The function is defined in exactly the same fashion for a RF trained on .

Treating the observed functions and as observations from functional distributions and , our goal is to determine whether these distributions that generated our observed prediction functions are, in fact, the same. More explicitly, we consider hypotheses of the form


Rejecting this null hypothesis supports the notion that migration patterns in the years 2008 and 2009 differed significantly from those observed from 2010-2013.

Recall from Figure 4 that in early fall (DoY 200-264), it appears that . Later in the fall (DoY 265-310), it appears and finally as winter sets in (DoY 311-365), we see . We therefore partition our time frame into three disjoint time periods, , , and , and let be an indicator function for each period. We then consider the restricted functional observations

which are defined in the same fashion for for . Each of these restricted functional observations are respectively from some distribution or , leading naturally to hypotheses of the form


for each .

To address each of the above hypotheses, we take a permutation-style approach. We begin by calculating the prediction functions over time using the original datasets and then, for each of many iterations, we permute the year covariate, re-partition the data into the two groups consisting of data from 2008-2009 and 2010-2013, and construct the new RF prediction functions. This generates a permutation distribution, , and under , .

Permutation tests for hypotheses of this form reject

if the test statistic,

, calculated on the original data, falls in the extreme (upper or lower

) quantile of the permutation distribution of test statistics. It can be shown that this test has exact size

when all possible permutations are used; see Chung and Romano (2013); Lehmann and Romano (2006) for more details on the theory of permutation tests. For our purposes, we approximate by considering 1000 permutations, so our test has approximate size . Permutation tests offer flexibility in the choice of test statistic, and different test statistics offer different levels of power. As such, we measure distance between and with a variety of statistics. In particular, we consider the following three measures of functional distance:


The measures in (3) and (4

) refer to the Kolmorgorov-Smirnov and Cramér-von Mises statistics traditionally used to determine whether an empirical cumulative distribution function could have come from a particular class. The use of these statistics for two sample functional testing procedures was studied in

(Hall and Van Keilegom, 2007). The and statistics are calculated across the full time period, testing for an overall difference in the underlying distributions and therefore are used to evaluate the hypotheses in (1). In contrast, our raw distance measures are designed to test for equality of the underlying functional distributions only in time periods , respectively. Based on the visual evidence in Figure 4, we may expect to see a difference during the first two time periods, but likely not during the third time period.

4.2 Global Test Results

(a) , with year unpermuted
(b) , with year permuted
Figure 7: Functional prediction curves generated by RFs all covariates, including max_temp and DoY. Functions were smoothed with a Gaussian kernel with bandwidth 7.5.

We utilize the randomForest package in R to calculate the RF predictions and then carry out the permutation tests described above (Liaw and Wiener, 2002). In order to maintain consistency with the local tests in the following section, we make all covariates available for splitting at each node so that the procedure reduces to what we would more commonly refer to as bagging. This reduces the number of trees needed to attain predictive stability. As in Section 3.1, the RFs are trained on the entire set of predictors, including both DoY and max_temp. The associated functions, are shown in Figure 7 for forests trained on the original data, and for forests trained on a particular (randomly selected) permutation. Based on a visual inspection, it appears that for RFs trained on the original 2008-2009 data () and the reduced nearest neighbor 2010-2013 data (), until around DoY 290, with negligible differences thereafter. However, it is unclear whether these visual differences in early autumn are substantially larger than what could be expected by random chance. The p-values associated with the permutation tests are shown in the second row of Table 1. Based on these results alone, there does not appear to be strong evidence of a difference in the underlying functional distributions, even early in the migration period.

(a) Original data, max_temp removed
(b) Permuted data, max_temp removed
(c) Original data, DoY removed
(d) Permuted data, DoY removed
Figure 8: Prediction functions generated by RFs with DoY included and max_temp removed (top row, (a) and (b)) and with max_temp included and DoY removed (bottom row, (c) and (d)).

To account for the correlation between max_temp and DoY, we conduct two additional followup versions of the original permutation tests: once with DoY included and max_temp removed and once with max_temp included and DoY removed. The RF prediction functions corresponding to these situations with both the original and (randomly selected) permuted datasets are shown in Figure 8. Here we begin to see evidence for the importance of max_temp: when only DoY is used as a predictor, the prediction functions trained on the original datasets closely resemble those trained on the permuted data. However, when max_temp is included and DoY is removed, we see a clear difference in predicted occurrence until late in the season which closely matches the anecdotal accounts.

Test Statistic
DoY & max_temp included 0.247 0.196 0.075 0.768 0.829
DoY included; max_temp removed 0.540 0.565 0.182 0.676 0.677
max_temp included; DoY removed 0.062 0.070 0.055 0.020 0.711
Table 1: Permutation test p-values with all covariates included in the datasets (row two), DoY included and max_temp removed (row three), and max_temp included and DoY removed (row four).

The p-values resulting from these additional permutation tests appear to tell a similar story. From Table 1 we see that when max_temp is removed, the smallest p-value is only 0.182 corresponding to the test for raw differences in time period one as measured with . However, when max_temp is included and DoY removed, the largest p-value among the first four tests is only 0.07. The p-value from the final test for raw differences in the third time period (measured by ) is large at 0.711, but recall that this is what was expected as the prediction curves appear very similar in all cases late in the season.

Before continuing with the localized tests the in the following section, it is worth pausing to acknowledge that the p-values from the third set of permutation tests, though reasonably small and substantially lower than in the other tests, fail to surpass the commonly accepted threshold in all but one instance and are a bit larger than we would ideally like to see in order to decisively conclude that migration patterns differed significantly in the two sets of years. However, note that these tests are conservative in two ways. First and most obviously, permutation tests themselves suffer lower power than their parametric counterparts. More subtly, the RF prediction curves generated by the data from 2010-2013 were not trained on the full available dataset, but for purposes of computational efficiency, were trained on a carefully selected subset designed to spatiotemporally mimic the observations collected in 2008-2009. Given this, it is reasonable to interpret the results of these tests as providing at least moderate evidence for a difference in migration patterns that is influenced by max_temp. The localized tests in the following section allow for a more direct means of measuring the precise questions of interest and provide far more decisive evidence.

5 Localized Hypothesis Testing

The tests carried out in the previous section average out occurrence and the covariate effects across the entire BCR30 spatial region, neglecting the local factors that likely play a role in Tree Swallow migration. Here we evaluate the effects of max_temp at a local spatial level by reexamining the above hypotheses at small sets of specific test points that are local with respect to both space and time. In order to conduct these tests, we make use of recent asymptotic theory which provides a closed form distribution for predictions originating from subsampled ensemble estimators. We summarize these recent theoretical advances in the following section before presenting our testing procedure.

5.1 Subsampled Random Forests

Instead of bootstrapping, we now consider constructing our base learners – randomized trees in the case of random forests – with proper subsamples. In recent years, our statistical understanding of these black box methods has been advanced substantially by considering this alternative form of construction. Under the subsampling regime, Scornet et al. (2015) provided the first consistency proof of Breiman’s original random forests whenever the underlying regression function is additive. Wager et al. (2014) utilize the infinitesimal jackknife variance estimation procedure developed in Efron (2014)

in order to provide confidence intervals for predictions from random forests. More recently,

Mentch and Hooker (2016) demonstrated that subsampled random forest predictions could be seen as an extended class of U-statistics and were therefore asymptotically normal and amenable to classical forms of statistical inference. In follow-up work, Mentch and Hooker (2017) demonstrated that these results could also be used to formally test for interactions between covariates. Wager and Athey (2017) further explored the asymptotic distributions of predictions of random forests under moderate assumptions on the underlying data-generating process.

In developing our localized tests, we rely on the work of Mentch and Hooker (2016) so we now briefly review some key results. Suppose we have a training sample consisting of iid observations from some distribution with which we construct a (possibly randomized) ensemble consisting of base learners, each built with a subsample of size , and use this ensemble to predict at some location . Denote each base learner by so that we can write the expected prediction as and the (empirical) ensemble prediction as where each collection represents a subsample of size from . Subject to some minimal regularity conditions, the authors demonstrate that the ensemble prediction can be viewed as an incomplete, infinite-order U-statistic and thus, so long as ,


where and the other variance parameters are of the form


for and where denote additional iid observations from . The authors provide a direct method for consistently estimating this variance, though for large ensembles (), the infinitesimal jackknife approach taken in Wager et al. (2014) can also be used.

Importantly, this result can be utilized to construct formal hypothesis tests of variable importance. Suppose we have covariates and we want to test the predictive importance (significance) of . Recalling our notation from Section 3.2, let denote the prediction at from a subsampled random forest trained on the original data and let denote the prediction at from a subsampled random forest trained on a dataset where has been permuted. Define the difference in predictions as


and suppose that we have

such prediction points so that we can form the vector of prediction differences

. Mentch and Hooker (2016) show that when the same subsamples are used to construct the trees in each random forest, the differences are themselves infinite-order U-statistics and thus follow the asymptotic distribution in (6). Let be the estimated covariance matrix of the . Then, given our vector of pointwise differences,


and we can use this as a test statistic to formally evaluate the hypotheses


where denotes the true mean prediction at and the true mean prediction at across permuted versions of . This procedure naturally extends to the more general case where any subset of the features are tested for significance by simply permuting that entire subset of features. Furthermore, this procedure remains valid whenever those features are simply removed from the alternative random forest instead of being permuted, though the permutation-based approach is generally considered more robust and reliable.

5.2 Local Effects of Maximum Temperature

We return now to the question of determining whether maximum daily temperature can partially explain the different Tree Swallow patterns of occurrence observed in 2008-2009. The global tests in the previous section suggested that max_temp may provide information about the interannual variation in occurrence beyond what is provided by seasonal effects alone captured by DoY. We therefore want to distill the effect of max_temp from that of DoY. Figure 9 shows a time series of the average max_temp_anomaly for 2008-2009 and 2010-2013. Recall that max_temp_anomaly was defined as the observed max_temp for a particular eBird observation minus the expected daily maximum temperature at that place and time calculated from the external DayMet dataset. Anomalous temperatures are, by our definition, normalized for time of year, and so should be uncorrelated with DoY. The lack of a meaningful linear trend – see the gray line in Figure 9 – confirms this intuition. Moreover, there is a clear period of lower anomalous temperatures in 2008-2009, which suggests that the differences originally observed in Figure 4 and reflected in Figure 8 could be due in part to differences in max_temp_anomaly.

Figure 9: Average max_temp_anomaly in C for each day of fall, calculated from and (as defined in section 3.1). Smoothed with a Gaussian kernel with a bandwidth of 18.5. Average max_temp_anomaly for is superimposed in lighter gray.

To fit this into the hypothesis testing framework described in the preceding section, we calculate one subsampled random forest with the original data and another with a permuted version of max_temp_anomaly. Note that because the hypotheses in (10) are evaluated at only fixed test points, careful selection of these points is important. Since we are interested in evaluating the hypotheses across a variety of locations and times, we stratify our test points by location and conduct 6 different tests.

The training and test set used here are selected from points inside wildlife refuge areas. Wildlife refuges are of particular interest because they include areas that are resistant to local environmental changes due to environmental protections in place, helping to isolate the effect of regional temperature fluctuations on Tree Swallow occurrence. In total, we select 6 groups of 25 test points each, which are subsequently removed from the training set. These 6 groups and points are shown in Figure 10. These points are chosen to maintain consistent spatial coverage, allowing us to test the effect of max_temp_anomaly across BCR 30. The regions directly represented are: 1) The Boston Metropolitan Area, 2) Central Connecticut (CT), 3) Westchester County, New York (NY), 4) Western New Jersey (NJ), 5) the upper Chesapeake Bay, and 6) the lower Chesapeake Bay. Spatial centers for each of these regions had been manually selected based on a high density of observations and the test points were selected uniformly at random from within a 0.3 decimal degree radius. The final training set, after removing the 150 spatially-stratified test points, consists of 25727 observations.

Figure 10: Test points selected for study. Colors indicate the test region; dots indicate the specific locations of the test points. Light green overlay indicates Fish and Wildlife Service wildlife refuges. Data was provided by GoogleMaps, US Fish and Wild Life. Map created using the ggmap package in R (Kahle and Wickham, 2013)

We apply the above hypothesis testing procedure at each of our 6 regional test locations, building separate ensembles for each location. As in Section 4, we make all features available for splitting at each node in each tree so that our random forest procedure reduces to subsampled bagging (subbagging). As noted in Mentch and Hooker (2016), the independence between trees gained via the randomized trees in random forests is of little additional benefit here since our subsampling procedure already substantially decorrelates the base learners. These tests are implemented using the rpart package in R to construct the regression trees (Therneau et al., 2017). Keeping with the recommendations of Mentch and Hooker (2016), we take our subsample size to be . We build trees for each ensemble, corresponding to the number of trees necessary to attain high precision in the estimation of the covariance matrix, .

Testing Location Region Test Statistic p-value
1 Boston Metro Area 38.07 4.553E-02
2 Connecticut 58.12 1.889E-04
3 Westchester, NY 58.21 1.835E-04
4 Western NJ 62.14 5.275E-05
5 Upper Chesapeake 59.93 1.068E-04
6 Lower Chesapeake 28.44 2.880E-01
Table 2: Test statistics for the hypotheses in (10) at the points show in Figure 10

Table 1 summarizes the test statistics and p-values obtained from the tests in each region. These local tests for the significance of max_temp_anomaly suggest that the anomaly plays a substantial role in predicting occurrence in testing locations 2-5, with a damped effect in location 1 and no effect in location 6. These suggest a transition zone within BCR30 between testing locations 1 & 6 where max_temp_anomaly is important in predicting occurrence. North of this zone, temperatures may be too cold to allow insect activity in the fall and south of this zone, temperatures may be warm enough to allow insect activity year round. That is, we hypothesize that local temperature fluctuations in these areas may not be substantial enough to affect insect and therefore Tree Swallow occurrence. Importantly, recall that DoY was included as a covariate in each training set in these tests. Thus, these results suggest that not only is maximum daily temperature important for predicting occurrence, but that its influence is more than a mere seasonal effect.

6 Discussion

6.1 Ornithological Implications

Our goal in this work was to thoroughly examine Tree Swallow migration patterns from recent years and to examine the role temperature changes may have had in explaining differences between years. The global hypothesis tests evaluated over the entire BCR30 region in Section 4 provided evidence for the hypothesis that the seasonal patterns of distributions indeed differed in the years 2008 and 2009. The fact that this difference no longer seemed apparent whenever max_temp was excluded as a predictor supports the hypothesis that temperature plays an important role explaining year-to-year variation in occurrence. While these conclusions examine only the average region-wide effect, the corresponding localized hypothesis tests carried out at specific locations along the Tree Swallow migration route in Section 5 provide formal justification for the importance of maximum temperature beyond being merely a correlate of some other seasonally varying effect. These results are especially important in the context of climate change, providing the first statistically sound evidence from eBird data that variation in ambient temperatures is related to the mortality and/or migration of a wild bird.

6.2 Methodological Discussion

Finally, it is important to note that the procedures we implemented in addressing our hypotheses of interest are very general. Exactly the same approach could be taken to investigate distribution dynamics for any species as well as for far more general problems completely outside of an ecological context. Fundamentally, our problem involved assessing and characterizing the significance of a subset of available predictor variables in which the underlying regression function was believed to consist of nonlinear and complex interaction terms which, localized in predictor space, precluded the

a priori specification of a suitable parametric model for which traditional inference methods may have been available. We present two forms of black box inference: development of non-parametric permutation-style tests that are agnostic to the underlying procedure, and asymptotic results about the predictions of ensemble learners. The recent asymptotic results for infinite-order U-statistics allowed us to model the data through a series of flexible but complex black-box models – in our case, regression trees – while retaining the ability to formally characterize results.


A variety of R files and workspaces are provided for managing the data, performing the analysis, and generating the figures.


LM was supported in part by NSF DMS-1712041. GH was supported in part by NSF DMS-1712554. DW’s research on swallows was supported most recently by NSF DEB 1242573. DF was supported in part by The Leon Levy Foundation. We thank the eBird participants for their contributions and the eBird team for their support. eBird was supported by the Wolf Creek Foundation, NASA (NNH12ZDA001N-ECOF), and the National Science Foundation (ABI sustaining: DBI-1356308; with computing support from CNS-1059284 and CCF-1522054). Thanks to Satish Iyengar for providing valuable feedback on an early draft of this paper.


  • Bergmeir and Benítez (2012) Bergmeir, C. and Benítez, J. M. (2012). Neural networks in R using the stuttgart neural network simulator: RSNNS. Journal of Statistical Software, 46(7):1–26.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
  • Chung and Romano (2013) Chung, E. and Romano, J. P. (2013). Exact and asymptotically robust permutation tests. The Annals of Statistics, pages 484–507.
  • Diefenbach et al. (2007) Diefenbach, D. R., Marshall, M. R., Mattice, J. A., and Brauning, D. W. (2007). Incorporating availability for detection in estimates of bird abundance. The Auk, 124(1):96–106.
  • Dunn and Winkler (1999) Dunn, P. O. and Winkler, D. W. (1999). Climate change has affected the breeding date of tree swallows throughout north america. Proceedings of the Royal Society of London B: Biological Sciences, 266(1437):2487–2490.
  • Efron (2014) Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991–1007.
  • Fernández-Delgado et al. (2014) Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems. Journal of Machine Learning Research, 15(1):3133–3181.
  • Fink et al. (2018) Fink, D., Auer, T., Ruiz-Gutierrez, V., Hochachka, W. M., Johnston, A., La Sorte, F. A., and Kelling, S. (2018). Modeling avian full annual cycle distribution and population trends with citizen science data. bioRxiv, page 251868.
  • Fink et al. (2010) Fink, D., Hochachka, W. M., Zuckerberg, B., Winkler, D. W., Shaby, B., Munson, M. A., Hooker, G., Riedewald, M., Sheldon, D., and Kelling, S. (2010). Spatiotemporal exploratory models for broad-scale survey data. Ecological Applications, 20(8):2131–2147.
  • Friedl et al. (2010) Friedl, M. A., Sulla-Menashe, D., Tan, B., Schneider, A., Ramankutty, N., Sibley, A., and Huang, X. (2010). Modis collection 5 global land cover: Algorithm refinements and characterization of new datasets. Remote sensing of Environment, 114(1):168–182.
  • Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
  • Hall and Van Keilegom (2007) Hall, P. and Van Keilegom, I. (2007). Two-sample tests in functional data analysis starting from discrete data. Statistica Sinica, pages 1511–1531.
  • Hansen et al. (2000) Hansen, M., DeFries, R., Townshend, J. R., and Sohlberg, R. (2000). Global land cover classification at 1 km spatial resolution using a classification tree approach. International journal of remote sensing, 21(6-7):1331–1364.
  • Hastie (2017) Hastie, T. (2017). gam: Generalized additive models. R package version 1.14-4.
  • Hooker (2007) Hooker, G. (2007). Generalized functional anova diagnostics for high-dimensional functions of dependent variables. Journal of Computational and Graphical Statistics, 16(3):709–732.
  • Hussell (2003) Hussell, D. J. (2003). Climate change, spring temperatures, and timing of breeding of tree swallows (tachycineta bicolor) in southern ontario. The Auk, 120(3):607–618.
  • Johnston et al. (2015) Johnston, A., Fink, D., Reynolds, M. D., Hochachka, W. M., Sullivan, B. L., Bruns, N. E., Hallstein, E., Marrifield, M. S., Matsumoto, S., and Kelling, S. (2015). Abundance models improve spatial and temporal prioritization of conservation resources. Ecological Applications, 25(7):1749–1756.
  • Kahle and Wickham (2013) Kahle, D. and Wickham, H. (2013). ggmap: Spatial visualization with ggplot2. The R Journal, 5(1):144–161.
  • Kuhn (2017) Kuhn, M. (2017). caret: Classification and regression training. R package version 6.0-78.
  • La Sorte et al. (2016) La Sorte, F. A., Fink, D., Hochachka, W. M., and Kelling, S. (2016). Convergence of broad-scale migration strategies in terrestrial birds. Proceedings of the Royal Society B: Biological Sciences, 283(1823):20152588.
  • Lehmann and Romano (2006) Lehmann, E. L. and Romano, J. P. (2006). Testing statistical hypotheses. Springer Science & Business Media.
  • Liaw and Wiener (2002) Liaw, A. and Wiener, M. (2002). Classification and regression by randomforest. R News, 2(3):18–22.
  • McGarigal et al. (2012) McGarigal, K., Cushman, S., and Ene, E. (2012). Fragstats v4: spatial pattern analysis program for categorical and continuous maps. university of massachusetts, amherst, massachusetts, usa. goo. gl/aAEbMk.
  • Mentch and Hooker (2016) Mentch, L. and Hooker, G. (2016). Quantifying uncertainty in random forests via confidence intervals and hypothesis tests. The Journal of Machine Learning Research, 17(1):841–881.
  • Mentch and Hooker (2017) Mentch, L. and Hooker, G. (2017). Formal hypothesis tests for additive structure in random forests. Journal of Computational and Graphical Statistics, pages 1–9.
  • Moran (1948) Moran, P. A. (1948). The interpretation of statistical maps. Journal of the Royal Statistical Society. Series B (Methodological), 10(2):243–251.
  • Nicodemus et al. (2010) Nicodemus, K. K., Callicott, J. H., Higier, R. G., Luna, A., Nixon, D. C., Lipska, B. K., Vakkalanka, R., Giegling, I., Rujescu, D., Clair, D. S., et al. (2010). Evidence of statistical epistasis between disc1, cit and ndel1 impacting risk for schizophrenia: biological validation with functional neuroimaging. Human genetics, 127(4):441–452.
  • Robinson et al. (2017) Robinson, O. J., Ruiz-Gutierrez, V., and Fink, D. (2017). Correcting for bias in distribution modelling for rare species using citizen science data. Diversity and Distributions.
  • Sauer et al. (2003) Sauer, J. R., Fallon, J. E., and Johnson, R. (2003). Use of north american breeding bird survey data to estimate population change for bird conservation regions. The Journal of wildlife management, pages 372–389.
  • Scornet et al. (2015) Scornet, E., Biau, G., Vert, J.-P., et al. (2015). Consistency of random forests. The Annals of Statistics, 43(4):1716–1741.
  • Strobl et al. (2007) Strobl, C., Boulesteix, A.-L., Zeileis, A., and Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC bioinformatics, 8(1):25.
  • Sullivan et al. (2014) Sullivan, B. L., Aycrigg, J. L., Barry, J. H., Bonney, R. E., Bruns, N., Cooper, C. B., Damoulas, T., Dhondt, A. A., Dietterich, T., Farnsworth, A., et al. (2014). The ebird enterprise: an integrated approach to development and application of citizen science. Biological Conservation, 169:31–40.
  • Sullivan et al. (2009) Sullivan, B. L., Wood, C. L., Iliff, M. J., Bonney, R. E., Fink, D., and Kelling, S. (2009). ebird: A citizen-based bird observation network in the biological sciences. Biological Conservation, 142(10):2282–2292.
  • Therneau et al. (2017) Therneau, T., Atkinson, B., and Ripley, B. (2017). rpart: Recursive partitioning and regression trees. R package version 4.1-11.
  • Thormton et al. (2017) Thormton, M., Thornton, P., Wei, Y., Vose, R., and Boyer, A. (2017). Daymet: Station-level inputs and model predicted values for north america, version 3.
  • Toloşi and Lengauer (2011) Toloşi, L. and Lengauer, T. (2011). Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27(14):1986–1994.
  • Wager and Athey (2017) Wager, S. and Athey, S. (2017). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, (just-accepted).
  • Wager et al. (2014) Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. Journal of Machine Learning Research, 15(1):1625–1651.
  • Winkler et al. (2013) Winkler, D. W., Luo, M. K., and Rakhimberdiev, E. (2013). Temperature effects on food supply and chick mortality in tree swallows (tachycineta bicolor). Oecologia, 173(1):129–138.
  • Zuckerberg et al. (2016) Zuckerberg, B., Fink, D., La Sorte, F. A., Hochachka, W. M., and Kelling, S. (2016). Novel seasonal land cover associations for eastern north american forest birds identified through dynamic species distribution modelling. Diversity and Distributions, 22(6):717–730.

Appendix A: Moran’s I

We now describe the formal procedure used to evaluate the significance of spatial autocorrelation in Figure 6, which was mentioned in section 3. In order to formally test the following hypotheses

we first must define a distance matrix between points in the grid. Here we calculate pairwise distances

as the Euclidean distances between the latitude/longitude coordinates in the grid and utilize an inverse distance weighting scheme

The distance between a point and itself is 0, but we assign , as is standard practice. For computational feasibility, we make these calculations on only a sub-grid consisting of every sixth point in the original test grid. To evaluate the hypotheses, we use the test statistic (Moran, 1948)

as our test statistic, where is the sum of all entries in the weight matrix and is the number of grid points (in our case, 5822). We then calculate the standardized statistic which is asymptotically standard normal under . The calculated values are reported in table 3. Note that .

Day of Fall 1 21 41 61 81 101 121 141 161
0.042 0.0407 0.030 0.105 0.077 0.167 0.072 0.064 0.080
113.4 109.4 81.28 280.3 203.2 448.42 194.4 171.4 216.1
p-value 0 0 0 0 0 0 0 0 0
Table 3: Moran’s I test for spatial autocorrelation for each of the days for which a prediction map was generated.