Autoregressive models for time series of random sums of positive variables: application to tree growth as a function of climate and insect outbreaks

by   Zinsou Max Debaly, et al.

We present a broad class of semi-parametric models for time series of random sums of positive variables. Our methodology allows the number of terms inside the sum to be time-varying and is therefore well suited to many examples encountered in the natural sciences. We study the stability properties of the models and provide a valid statistical inference procedure to estimate the model parameters. It is shown that the proposed quasi-maximum likelihood estimator is consistent and asymptotically normally distributed. This work is complemented by simulation results and applied to annual growth rate time series of white spruce (Picea glauca) trees from a few dozen sites in Quebec spanning 41 years, including one major spruce budworm (Choristoneura fumiferana) outbreak from around 1968 to 1991. We found significant growth reductions due to budworm-induced by defoliation up to two years in the past. Our results also revealed positive effects of maximum temperature, precipitation and the climate moisture index in the summer, as well as negative effects of the climate moisture index in the spring and the maximum temperature in the previous summer. However, considering the interaction of climate and defoliation on growth did not improve the model's performance on this dataset. This study represent a major advances and our result represent an useful tool in the understanding of the combined effects of climate and insect defoliation on tree growth in the face of climate change, where the frequency and the severity of outbreaks, as well as an increase of temperature is expected.



page 1

page 2

page 3

page 4


A functional autoregressive model based on exogenous hydrometeorological variables for river flow prediction

In this research, a functional time series model was introduced to predi...

Clustering high dimensional meteorological scenarios: results and performance index

The Reseau de Transport d'Electricité (RTE) is the French main electrici...

Bayesian Appraisal of Random Series Convergence with Application to Climate Change

Roy and Bhattacharya (2020) provided Bayesian characterization of infini...

Poisson Network Autoregression

We consider network autoregressive models for count data with a non-rand...

On the Evolution of U.S. Temperature Dynamics

Climate change is a multidimensional shift. While much research has docu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many studies in ecology, we measure positive dependent variables on random numbers of statistical individuals sampled over time (montoro2019post). One can find two main reasons why: first, researchers cannot observe the whole population; second, which individuals researchers can observe depends on their time-varying resources. Examples range from species behaviour to ecological services. For instance, in forestry, one can be interested in the time series of mass or size of certain species, taking a random sample of trees each year and observing the corresponding quantity (see VOURLITIS2022119917 for example). Another example consists in the occupied land area of colonies in relationship with the availables resources across time (labrecque2020dynamics). In fisheries, scientists often investigate the temporal changes in fish catch weight, see w12112974 for more details.

In this paper, we contribute to the ongoing ecological study of the impact of climate change and insect outbreaks on tree growth, as measured by growth rings. Spruce budworm (Choristoneura fumiferana; SBW) outbreak is the most important defoliator of conifer trees in the North American boreal forest (montoro2018secret). Only in the province of Quebec (Canada), the forest surface affected by this species of Lepidoptera over the last century is twice the size of Ukrania (navarro2018changes). At the epidemic stage, massive populations of larvae cause widespread damage to tree foliage: defoliation (lavoie2019vulnerability). SBW affect the main conifer boreal species in Canada, balsam fir (Abies balsamea), white spruce (Picea glauca) and black spruce (Picea Mariana). For this reason, SBW has a major impact in the regeneration and dynamics of boreal forest (martin2020driving). However, SBW outbreaks periods, not only have major ecological impacts but also produce important economic consequences due to the loss of forest productivity.

Previous works in this field have studied the changes of forests composition following insect outbreaks (see for instance MORIN2021463), the response of SBW outbreaks to climate change (fleming1995effects and 10.3389/fevo.2020.544088) and the demography i.e the rate of mortality of spruce during outbreaks (gauthier2015boreal). However, even the major implication in the face of climate change, there is limited knowledge regarding the combined effects of outbreaks and climate on tree growth. Given that variations of temperature and precipitation affect organisms’ survival, reproduction cycles and spatial dispersion (aber2001forest), it is critical to understand the links between past SBW outbreaks, climate and tree growth to understand how future climate change scenarios would impact forest productivity during outbreaks (klapwijk2013forest). This is a major concern due to the increase in the severity of frequency excepted in the future for SBW outbreaks (navarro2018changes; seidl2017forest).

We contribute to filling this gap by proposing a broad class of semi-parametric for positive-valued time series. Indeed, even though this type of data is common in forestry, the statistical approaches commonly used suffer from several drawbacks. These approaches range from descriptive exploratory techniques to linear mixed-effect models with time-varying variables on transformed data (see montoro2016radial and boulanger2004spruce for example) and correlated error terms (GirardinE8406). Whereas the first one (exploratory techniques) do not allow us to draw inferences, we can note at least two limits for the latter one. Firstly, as pointed out by several papers, see for example chou2015range, specifying a linear model on transformed data often leads to worst performances in prediction. Secondly, models with autocorrelated error terms do not take into account the complex dependency structure of tree-ring growth. The class of semi-parametric autoregressive models we present here will be applied to investigate the relationship between climate, insect outbreaks and growth of white spruce. It also presents the advantage of accommodating the usual repeated measures design in ecology.

Many previous works have focused on modelling non-Gaussian time series, such as positive-valued processes. Indeed, Gaussian processes can be represented as linear models, whereas time series of count or binary data are modelled by non-linear dynamics, see for example sim1990first and references therein. For positive-valued time series data, the range volatility model was proposed by 10.2307/2999632 as an alternative for garch model in finance and its use has been rapidly expanding due to its various applications. We refer the interested reader to the review of chou2015range. Recently, aknouche_francq_2020 have considered a positive-valued time series whose conditional distribution has a time-varying mean that can depend on exogenous variables. Our approach here is slightly different from theirs, since the positive process under consideration is itself the sum of a random number of other positive variables. It is strongly driven by the data we have to deal with. Indeed, these data consist of multiple time series collected over several ecological sites, where the number of individuals sampled changes over time as well as across sites. Hence, considering an aggregate value like the sum or the mean of growth rings leads to the loss of variability linked to the sampling schema. Moreover, one can note that in different fields like finance, some modeling strategies consisting in considering empirical quantities such as the realized volatility are employed. Historical returns of investment products within a defined time period are then analysed (see for example allen2010realized). However, unlike our framework and ecological studies in general, all transactions on investment products are recorded, i.e the whole statistical population is observed.

The rest of this paper is organised as follows. In Section 2, we define the model under consideration throughout this paper and discuss our modelling choices. Time-series properties of the models are also studied in that section. Maximum-likelihood based inference and its asymptotics properties are presented in Section 3. Section 4 contains a small simulation study and an application to empirical data on the growth of white spruce. All auxiliary lemmas and mathematical proofs are contained in Section 5.

2 Models and stability results

We introduce here a generalized linear dynamic model for time series of random sum of positive variables, motivated by the empirical application where we analyze the annual growth of spruce trees subject to climate variation and outbreaks of SBW. In this case, growth is measured by taking cores at 1.30 m heigh from the trunk of a sample of trees in a forest (montoro2017understanding). The samples were prepared, measured and analyzed conforming to standard dendroecological protocol (krause1995changes). Cores were air-dried, mounted on wood boards and sanded before tree rings were measured with WinDendro system (guay1992new) or a manual Henson micrometer with an accuracy of 0.01 mm. The tree-ring series measurements covered the last 41 years, and were cross-dated using TSAP-Wi (Rinntech, Heidelberg, Germany).

We denote by the time series of the total basal area increment related to the th observational site, i.e. the sum of the increases in trunk cross-sectional area for the trees sampled for site on year . We aim to model the dynamics of this process both in terms of its own past and additional covariates . In the empirical application presented in section 4, the covariate process encompasses climate variables such as temperature and precipitation, as well as the level of defoliation due to SBW in previous years. Our model is given by :


where conditionally on and , the variables

representing the basal area increments of individual sampled trees, are independent and identically distributed like a random variable

of mean . Moreover, is a sequence of i.i.d random variables and conditionally on the variable is independent from and . The mean process is given by


and is a real-valued function defined on that can depend on a parameter It is worth mentioning, without loss of generality, that the covariate process considered at time is included in the specification of

since multiple lags of a given set of variables can be included by simply stacking them into a vector. It is for example the case of the defoliation level in our application, since growth can be affected by defoliation up to 5 years prior (from

to ).

The variables will be referred to as the unity random variables. We do not make any assumption about the distribution of the variables . Any distribution on

can be chosen. For example, an Exponential distribution with parameter

, log-Normal distribution with parameters


or a Gamma distribution with parameters

and , to name a few. Whatever the distributions of unity random variables are, the conditional expectation of is . However, under the assumption of the indenpendence of

if they are exponentially distributed, the conditional variance is

i.e a quadratic function of . For our example of Gamma-distributed unity random variables, the conditional variance is , i.e. a linear function of . But in the case of the log-Normal distribution, the conditional variance is With our semi-parametric framework, we will only focus on the estimation of regression parameters without the need to perform any distributional goodness of fit test.

Notes 2.1 Copies of unity variables

In our general set up, the copies of the unity random variables are not required to be independent. In practice where for example represents the measure of annual growth for a sampled tree, the general assumption of identical distribution can be thought as a local stationary condition inside the site at time .

Notes 2.2 Marginal stationary distributions

Note from equations (1)-(2),
for Then for the distributions of and are not equal unless and is equal in distribution to . We will investigate in section 4 the consequences of the latter conditions on the proposed estimation procedure.

Notes 2.3 Form of regression function (2)

Note that in (2) does not depend linearly on , but on In fact, through (2), we make a link between the underlying mean process and the empirical estimate of the past mean process. Even for a constant size process, i.e. , since the regression parameter is free of , we still cannot yet express as a linear combination of . Moreover, one can expect or more generally for some mapping such that in (2) at the place of Indeed, with the latter two mentioned specifications, (1)-(2) define the so-called GLARMA model (see for example, weiss2018introduction for more details). In the present form (1)-(2) has some similarities with the well known ARCH model (Bollerslev(1986)). We leave the topic of GLARMA specification for furthers works.

Notes 2.4 Contrast with the non-linear mixed model

The model (1)-(2) has some similarities with the well-known mixed models. Indeed, as for mixed models, the stands for the site fixed effect and the random effect in embedded is the distribution of unity variables. The simple example of , where is a sequence of identically distributed random variables of mean 1, fit with the so-called multiplicative form random effect models (cameron2013regression). But more complex random effects can be handled. However, the model (1)-(2) is more general since it allows the individuals sampled over time to change. Indeed, as we will see in section 3, the individual measures are no longer needed when the sequence are available. Also, in terms of the application to resource management, it is often of interest to model and predict a population quantity like the sum of basal area growth in a forest.

Choice of the link function

The logarithmic link function is often applied and coincides with the well known log-linear model, see for example cameron2013regression for models for count data. This link function assumes a linear relationship between the logarithm of the mean process and the covariates. However, there exist some other link functions that preserve the linear correlation at least on the positive part of Consider for example, the threshold mapping . This mapping is not smooth and most of time, one makes some restrictions on model parameters to directly obtain the positiveness of the mean. Here, we will apply the inverse of the so-called softplus function as a link function. Indeed, the softplus function (see glorot2011deep) is interesting for two reasons. The first one related to modelling is that it preserves the linearity on the positive part of real line. This is also pertinent for our biological application, as we expect a linear effect of covariates on growth above a certain threshold representing the minimal favorable conditions for growth. The minimum growth expected may not be exactly zero, which is why we will consider later a slightly different version of softplus that we will refer to as for defined as . The second one and technical advantage is that the mapping is infinitely differentiable. The Figure 1 in the Appendix shows the difference between the link function and where stands for One can note that is lower bounded by

Notes 2.5 Model Interpretation

Obviously, with the softplus link function, the mean process increases with the th covariate process if and decreases with this one when Since softplus, the mean process can be approximated, all other things remaining equal, by for large values of and and then increases by for increasing value of . Let us denote by RG, the relative rate of growth of the mean process between and i.e RG where is the derivative function of softplus. For RG Therefore, the rate toward driven by is given by when Moreover, When by the l’Hôpital’s rule RGsoftplussoftplus Therefore, all other things remaining equal, the mean process will be divided by when increases by for large values of and

Theorem 1 gives some stability conditions of model (1)-(2) with the inverse of the softplus function as the link whereas Lemma 1 in the section 5 stands for a general result for .

Theorem 1.

Under the assumptions a-b in section 5 and , there exists a unique set of stationary, ergodic sequences that are the solution of equations (1)-(2) with .

3 Estimation and asymptotics properties

This section is devoted to the estimation of the conditional mean parameters by the Quasi-Maximum Likelihood Estimator (QMLE) based on a member of the exponential family. We consider the Exponential QMLE (EQMLE) because this estimator coincides with the Maximum Likelihood Estimator (MLE) when the unity random variables follow the Exponential distribution and the copies are independent.

For our application of the model, the time series are observed between the time points and We provide an asymptotic theory for the estimated parameters and present the results of a small simulation study investigating the finite-sample properties of the estimator. In the following, we will make depend on the parameter ( a compact set); that is

where . Let us denote the true, data-generating parameter value by .

The loss function from the Exponential quasi-maximum likelihood is given by :




The derivative of with respect to is given by :

where is a vector of size with at the th position and elsewhere. We will denote by (resp. ) the vector (resp. ) evaluated at the point

We will study the asymptotic properties of the QMLE estimator (4). To do so, we employ taniguchi2002asymptotic (Thm 3.2.23), which was extended in klimko1978conditional. The lemmas in the section 5 give the general result for the asymptotic properties of QMLE (4). The following theorem stands for the consistency and the asymptotic normality of (4) for link function. Let us set

Theorem 2.

Suppose that the assumptions a-d in section 5 are met. Then, almost surely,

If in addition A-C hold true and is located in the interior of ,

where and

4 Application

4.1 Simulation

We examined the finite-sample performance of the QMLE presented in the previous section through a small simulation study. We present the result for QMLE under two different data generating processes referred to as scenario 1 and scenario 2. For the first one, the number of covariates is ; does not depend on and is a sequence of i.i.d random variables distributed as exponential random variables of means For the second one, for a fixed is independently sampled from exponential distributions of mean For the two data generating processes, for a fixed , the process

is independently sampled from a Poisson distribution of mean

We sequentially choose and The samples are nested, i.e the sample for the first scenario and is a subset of that of . Indeed, we aim here to evaluate the consequences of increasing and on the performance of our estimator. For each sample, we compute the estimator (4

) and the corresponding standard errors. The table

1 presents the simulation results. It appears that the model parameters are well estimated except for the when is very small compared to , which coincides here with It is worth noting that these results are partial since they are based on a one sampling schema. We leave deep simulation studies for further works.

4.2 Application to the white spruce growth series

Dendrochronology, i.e. the studies of the time series of tree growth rings, is a powerful tool to reconstruct past natural and anthropic disturbances (montoro2016radial, boulanger2004spruce and labrecque2020dynamics). Tree-rings are hard disks of information, able to record each environmental change, having thus a strong potential to understand complex phenomena such as disturbance ecology. Many previous studies used dendrochronological data to better understand insect outbreak dynamics (navarro2018changes, camarero2003impact and speer2017creating).

In this research, we used the dendroecological series from the study by jardon2003periodicite, which includes annual tree-ring width measurements for 631 white spruce (Picea glauca) trees distributed across 45 sites in southwestern Quebec, Canada, with 1 to 23 trees per site. These time series comprise between 63 and 247 rings according to the tree’s age. We converted the ring width increments to basal area increments (BAI) using the full series, but due to covariate availability, we limit our analysis to the 1955-1995 time period (41 years) to study only one insect outbreak period (see fig 2 in Appendix).

We interpolated climate variables at the study sites for the 41-year period using BioSIM

(regniere2014biosim), a software package that interpolates daily climate station data based on latitudinal and elevational climate gradients, as well the spatial correlations estimated from 30-year climate normals. We computed the following climate summaries from daily data for the spring (April to June) and summer (July to September) seasons separately: mean of the daily maximum temperatures, total precipitation, and the climate moisture index (CMI) equal to the difference between precipitation and potential evapotranspiration (PET). Daily PET values were estimated by the Penman-Monteith equation as implemented in the SPEI package (pkgSPEI) in R, based on BioSIM-interpolated values of the minimum and maximum temperature, wind speed at 2 m, solar radiation, dew point temperature and atmospheric pressure, using the "tall" crop model in SPEI.

One major SBW outbreak occurred in Quebec during the study period, from 1967 to 1991. We obtained annual estimates of the severity of SBW outbreaks at the location of each study site from defoliation maps produced by the Quebec Ministry of Forests, Wildlife and Parks (mffp_tbe)

. These maps are digitized versions of hand-drawn outlines of defoliated areas produced by aerial surveys of the affected regions. The defoliation level for each area is classified on a scale of 1 to 3 corresponding to a low (approx. 1 – 35%), moderate (36 – 70%) or high (71 – 100%) fraction of the year’s foliage defoliated by the SBW. We note that these defoliation levels mainly reflect the status of balsam fir (

Abies balsamea) trees, which is the main SBW host and is generally more severely affected than white spruce. Therefore, these defoliation levels are a proxy for the outbreak severity, i.e. the potential herbivory pressure exerted by the budworm on spruce trees at the site.

Since tree growth and its vulnerability to both climate and defoliation depends on the tree-age, we split the dataset and separately fit our models for the following five age classes: < 75, 75 – 100, 100 – 125, 125 – 150 and > 150 years. We include as covariates the mean daily maximum of temperature, the total precipitation and the mean CMI for the current and previous spring and summer. Only one of precipitation and CMI appears in a given model version due to the correlation between those two variables. We also include as covariates the defoliation levels for the five previous years, a delay which estimates the time needed to fully regrow the lost foliage after an outbreak. Note that we do not expect defoliation to have a marked effect on the same year’s growth ring (krause2003temporal). Finally, we consider models with interaction effects of the previous year’s defoliation level and climate variables, representing the possibility that climate conditions can increase or decrease the tree’s sensitivity to SBW outbreaks.

Data processing and analyses were performed in rcitation with the package dplR (bunn2008dendrochronology) used to process tree-ring data. We minimize the criterion (3) with the R command nlm (dennis1983numerical). All the developed software are made available under the Creative Commons (CC) license (see data availability statements). The model selection was carried out through the the QAIC criterion. The primary analysis based on partial autocorrelation plots leads us to select .

According to the QAIC, the best models were those without an interaction between climate and defoliation. Our model results (Figures 3 and 4) reveal that higher defoliation levels leads to reduced tree-ring growth, but this effect vanishes after two years; however, note that while the direct effect vanishes, expected growth will remain lower in successive years due to the large estimated first-order autocorrelation coefficient (0.8 to 0.9, depending on age class). Moreover, there is no significant effect of defoliation on the next year’s growth for the youngest and oldest trees, even though it produces an effect two years following the defoliation. The result are quite different for middle aged trees, which are significantly affected one year following the defoliation but not in the second year. For the climate variables, high maximum temperatures in the summer produce increased growth, with up to 5.6 square centimetre increase in basal area from a 10 degree Celsius increase in summer maximum temperature. However, the previous summer’s temperature has a negative effect on growth. Finally, the spring CMI is negatively correlated with tree-ring growth whereas the summer CMI has a positive effect. However, both the CMI and precipitation in the previous spring increase tree-ring growth of the current year: an increase of 100 millimetres in precipitation leads to at least a 6.8 square centimetre increase in basal area growth.

5 Proofs for the main results

Throughout this section, we will denote by , the sequence of copies of the unity random variables Moreover can be decomposed into two components : its mean function of and a free random variable . For example, for a positive random variable of mean 1. We will write to denote the relationship between and and . Accordingly, with i.i.d with mean 1 or in general with . Let denote the -algebra generated by and generated by Finally, we will denote by the inverse of . For stability, we will consider the following set of assumptions :

  1. The function is Lipschitz and

  1. For is stationary, ergodic, is independent from and

  2. For

It is worth noting that the example for a positive random variable of mean 1 verifies the condition b.

Lemma 1.

Under the assumptions A-a, there exists a unique set of stationary, ergodic sequences that are a solution of equations (1)-(2) with .

The proof of lemma 1 uses the techniques of iterated random maps. We refer the interested readers to debaly_truquet_2021 theorem 2 and 4 which investigated the problem of the solution of recursive stochastic equations with covariates or debaly2019stationarity in the case where no covariates are included in the dynamic.

Proof of lemma 1

From (2),

Then under the condition a, the processes obey some recursive stochastic equations

and with A, for ,

with Moreover . Then, from debaly_truquet_2021 theorem 4, we get the stationary and ergodic solution with .

The theorem 1 is a straight consequence of lemma 1 and follows the Lipschitz property of for any . For the asymptotic results for , the following assumptions will be needed.

  1. The conditons A and a are met and verifies :

  2. For and

  3. For ,

  1. The conditons a and b are met and verifies :

  2. For

  3. For conditionally on , the distribution of

    is not supported by an hyperplan of

  4. For the distribution of is not degenerate.

Lemma 2.

Suppose that the assumptions B-D and c-d are met. Then, almost surely,

We do not prove the lemma 2. Similar results for time-series models can be found in diop2021inference, aknouche_francq_2020 or debaly2021multivariate among others.

Proof of consistency part of theorem 2

We will check B to D.

  • B comes from a.

  • One can note that here and and where are continuous functions of Then C holds since Indeed since

  • For D, we note that

    and Then and

Let us set , the derivative of with respect to , the vector of parameters without We will consider the following assumptions for the asymptotic distribution of

  1. The function is twice continuously differentiable and for

  2. For the distribution of is not degenarate.

  3. For , where is one of the following quantities for all pairs

  1. The stationary sequences solution of (1)-(2) are independent of each other.

  2. For

  3. For

Lemma 3.

Under the assumptions of lemma 1 and if E-G and A hold, then

where and ,

As for lemma 2, we do not prove the lemma 3. We refer the interested reader to diop2021inference, aknouche_francq_2020 or debaly2021multivariate among others.

Proof of asymptotic normality part of theorem 2

For the proof of asymptotic normality part of theorem 2, one can note that in the single framework (), assumptions B yield the asymptotic normality of

using the central limit theorem for difference martingale. Next,

For the first term,


where for means and a function of Then, under the assumption C. It can be shown similarly that and By the Taylor expansion of between and ,

The independence condition on paths A, assumption B and central limit theorem for difference martingale allow us to conclude converges in distribution to central Gaussian vector of variance as tends to infinity. The assumption C and ergodic theorem entails that