Latent Gaussian Count Time Series Modeling

11/01/2018
by   Yisu Jia, et al.
0

This paper develops theory and methods for the copula modeling of stationary count time series. The techniques use a latent Gaussian process and a distributional transformation to construct stationary series with very flexible correlation features that can have any pre-specified marginal distribution, including the classical Poisson, generalized Poisson, negative binomial, and binomial count structures. A Gaussian pseudo-likelihood estimation paradigm, based only on the mean and autocovariance function of the count series, is developed via some new Hermite expansions. Particle filtering methods are studied to approximate the true likelihood of the count series. Here, connections to hidden Markov models and other copula likelihood approximations are made. The efficacy of the approach is demonstrated and the methods are used to analyze a count series containing the annual number of no-hitter baseball games pitched in major league baseball since 1893.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

page 29

page 30

11/21/2021

Seasonal Count Time Series

Count time series are widely encountered in practice. As with continuous...
05/28/2018

Pairwise likelihood estimation of latent autoregressive count models

Latent autoregressive models are useful time series models for the analy...
03/31/2020

Modelling and understanding count processes through a Markov-modulated non-homogeneous Poisson process framework

The Markov-modulated Poisson process is utilised for count modelling in ...
01/10/2019

Penalized estimation of flexible hidden Markov models for time series of counts

Hidden Markov models are versatile tools for modeling sequential observa...
07/05/2021

Zero-modified Count Time Series with Markovian Intensities

This paper proposes a method for analyzing count time series with inflat...
03/18/2021

Stationary underdispersed INAR(1) models based on the backward approach

Most of the stationary first-order autoregressive integer-valued (INAR(1...
04/07/2020

Forecasting count data using time series model with exponentially decaying covariance structure

Count data appears in various disciplines. In this work, a new method to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper develops the theory and methods for copula modeling of stationary discrete-valued time series. Since the majority of discrete cases involve modeling integer counts supported on some subset of , we hereafter isolate on non-negative integer count structures. Our methods are based on a copula transformation of a latent Gaussian stationary series and are able to produce any desired count marginal distribution. It is shown that the model class produces the most flexible pairwise correlation structures possible, including negatively dependent series. Model parameters are estimated via two methods: 1) a Gaussian pseudo likelihood approach, developed with some new Hermite expansion techniques, that only employs the mean and the autocovariance of the series, and 2) a particle filtering approach that adapts hidden Markov model (HMM) techniques to general (not necessarily Markov) latent Gaussian stationary series to approximate the true likelihood. Extensions to non-stationary settings, particularly those where covariates arise, are also discussed.

The theory of stationary Gaussian time series is by now well developed. A central result is that one can have a stationary Gaussian series , with lag- autocovariance , if and only if is symmetric ( for ) and non-negative definite, viz.

for every , and real numbers (see Theorem 1.5.1 in [6]). Such a result does not hold for count series. For example, existence of a stationary series with a Poisson marginal distribution is not guaranteed when is merely symmetric and non-negative definite. In fact, in the Poisson case, such a result is false: is symmetric and non-negative definite, but it it is impossible to have and jointly distributed with the same Poisson marginal distribution and a correlation of (the reader is challenged to verify this non-existence). In principle, distributional existence issues are checked with Kolmogorov’s consistency criterion (see Theorem 1.2.1 in [6]); in practice, one needs a specified joint distribution to check for consistency. Phrased another way, Kolmogorov’s consistency criterion is not a constructive result and does not illuminate how to build time series having the desired marginal distributions and correlation structures. Owing to this, count time series have been constructed from a variety of methods over the years. We now discuss past approaches to the stationary correlated count problem; a recent overview of stationary methods is contained in [34]. While extensions to non-stationary cases with covariates are considered later, the fundamental problem lies with constructing models for stationary count series.

Borrowing from the success of autoregressive moving-average (ARMA) models in describing stationary Gaussian series, early count authors constructed correlated count series from discrete autoregressive moving-average (DARMA) and integer autoregressive moving-average (INARMA) difference equation methods. Focusing on the first order autoregressive case for simplicity, a DAR(1) series with marginal distribution is obtained by generating from and then recursively setting

where is generated as independent and identically distributed (IID) with marginal distribution and are generated IID independent Bernoulli trials, independent of , with . Induction shows that has marginal distribution for every . INAR(1) series are built with a thinning operator and parameter defined by

for any count valued random variable

, where

is again a collection of IID Bernoulli trials with success probability

. The INAR(1) difference equation is

where is an IID count-valued random sequence.

DARMA methods were initially explored in [25, 26], but were subsequently discarded by practitioners because their sample paths can remain constant for long periods in highly correlated cases (observe that for DAR(1) series); INARMA series are still used today. Both INAR(1) and DAR(1) models have a correlation function of form , which cannot be negative since . While one can add higher order autoregressive and even moving-average terms to the DAR(1) and INAR(1) setups (see [37, 38, 39]), all correlations remain non-negative. In contrast to the Gaussian ARMA brethren, DARMA and INARMA models do not span the entire range of possible correlation structures for stationary count series. Extensions of DARMA and INARMA methods are considered in [28], but again, none can produce series with negative correlations.

Blight [5] and [9] take a different approach, constructing the desired count marginal distribution by combining IID copies of a correlated Bernoulli series in various ways. By using a binary constructed from a stationary renewal sequence [5, 9, 27], a variety of marginal distributions, including binomial, Poisson, and negative binomial, were produced; moreover, these models can have negative correlations. While these models do not necessarily produce the most negatively correlated count structures possible, they often come close to achieving this bound. The work in [33] derives explicit autocovariance functions when is made by binning a stationary Gaussian sequence into zero-one categories and gives an example of a hurricane count data set where negative correlations arise. That said, some important count marginal distributions, including generalized Poisson, are not easily built from this suite of methods. The results here easily generate any desired count marginal structure.

Other count model structures studied include Gaussian based processes rounded to their nearest integer [29], hierarchical Bayesian count model approaches [3], generalized ARMA methods [4], and others. Each approach has some drawbacks. For example, rounding Gaussian processes to their nearest integer makes it difficult to produce a specific prescribed marginal distribution. Many hierarchical Bayesian procedures also exist for count series. These methods typically posit conditional distributions in lieu of marginal distributions. For example, in a Poisson regression, the Poisson marginal stipulation is being imposed on the marginal distribution of the data. This is not the same as positing a conditional Poisson setup where one takes given some random to have a Poisson distribution with mean . Indeed, as [2] shows, once the randomness of is taken into account, the true marginal distribution can be far from Poisson.

As noted above, a very flexible class of stationary count time series models is constructed here, employing a latent Gaussian process and copula transformations. These techniques have recently shown promise in spatial statistics [14, 22], multivariate modeling [42, 43], and regression [35], but the theory has yet to be developed for count time series ([35, 32] provide some partial results). Our objectives here are several-fold. On a methodological level, it is shown, through some newly derived expansions based on Hermite polynomials, that accurate and efficient numerical quantification of the correlation structure of the copula count model class is feasible. Based on a result of [46], the class produces the most flexible pairwise correlation structures possible — a property quantified in Remark 2.2. Connections of the copula model are also revisited to both importance sampling schemes, where the popular GHK sampler is adapted to the time series context, and to the HMM literature, which allows natural extensions of the GHK sampler and computation of quantities other than the likelihood. All methodological contributions are tested on both synthetic and real data. Other prominent count time series works include [17, 12, 11, 18, 16].

The works [35, 32] are perhaps the closest papers to this study. While the general latent Gaussian construct adopted is the same, our work differs in that explicit autocovariance expansions are developed via Hermite expansions, flexibility and optimality issues of the model class are addressed, simple Gaussian pseudo-likelihood estimation of model parameters is developed, and both the importance sampling and HMM connections are explored in greater depth. More detail on the connections to [35, 32] can be found in the main body of the paper. The works [22, 24] are also closely related and will be referenced below with commentary, but their focus is on spatial count modeling.

The rest of this paper proceeds as follows. The next section introduces our latent Gaussian count model and establishes its basic mathematical and statistical properties. Section 3 moves to estimation, developing two techniques. The first method is a Gaussian pseudo-likelihood approach that involves only the mean and covariance of the series and can be numerically optimized to rapidly obtain model parameter estimates. The second method uses particle filtering techniques to construct an approximation of the true likelihood of the series. Here, connections to HMM and importance sampling methodologies are made. Section 4 presents simulation results, showing some standard cases where the simpler Gaussian pseudo-likelihood approach performs reasonably well. The section also shows a case where particle filtering likelihood estimates, which feel the entire joint count distributional structure, are superior. Section 5 analyzes a count series containing the annual number of no-hitter Major League Baseball games since 1893. Here, two covariates are considered: pitching mound height and the number of games played in each season. Section 6 closes the paper with some comments and suggestions for future research.

2 Theory

We are interested in constructing stationary time series that have marginal distributions from several families of count structures supported in , including:

  • Binomial (Bin()): , , ;

  • Poisson (Pois()): , with ;

  • Mixture Poisson (mixPois()): , where with the mixture probabilities such that and with for each ;

  • Negative binomial (NB()): , with and ;

  • Generalized Poisson (GPois()): , with and ;

  • Conway-Maxwell-Poisson (CMP): , with , , and a normalizing constant making the probabilities sum to unity.

The negative binomial, generalized Poisson (when

), and Conway-Maxwell-Poisson distributions are over-dispersed in that their variances are larger than their respective means. This is the case for sample variances and means of many observed count time series.

Let

be the stationary count time series of interest. Suppose that one wants the marginal cumulative distribution function (CDF) of

for each of interest to be

, depending on a vector

containing all CDF model parameters. The series will be modeled through the copula type transformation

(1)

Here,

(2)

where is the CDF of a standard normal variable and

(3)

is the generalized inverse (quantile function) of the non-decreasing CDF

. The process is assumed to be standard Gaussian, but possibly correlated in time :

(4)

that is, for each . This approach was recently used by [42, 35, 22] with good results. The autocovariance function (ACVF) of , denoted by , is the same as the autocorrelation function (ACF) due to the standard normal assumption and depends on another vector of ACVF parameters.

The model in (1)–(3) has appeared in other bodies of literature under different nomenclature. In particular, [7, 8] call this setup the normal to anything (NORTA) procedure in operations research and [20] calls this a translational model in mechanical engineering. The goal here is to give a reasonably complete analysis of the probabilistic and statistical properties of these models.

The construction in (1) ensures that the marginal CDF of is indeed . Elaborating, the probability integral transformation theorem shows that

has a uniform distribution over

for each ; a second application of the result justifies that has marginal distribution for each . Temporal dependence in will induce temporal dependence in as quantified in the next section. For notation, let

(5)

denote the ACVF of .

2.1 Relationship between autocovariances

The autocovariance functions of and can be related using Hermite expansions (see Chapter 5 of [41]). More specifically, let

(6)

be the expansion of in terms of the Hermite polynomials

(7)

The first three Hermite polynomials are , , and ; higher order polynomials can be obtained from the recursion . The Hermite coefficients are

(8)

The relationship between and is key and is extracted from Chapter 5 of [41] as

(9)

where the power series is

(10)

In particular,

(11)

depends only on the parameters in the marginal distribution . Note also that

(12)

where

(13)

and . The function maps into (but not necessarily onto) . For future reference, note that and by (11),

(14)

Using (6) and gives

(15)

however, is not necessarily in general. As such, “starts” at , passes through , and connects to . Examples will be given in Section 2.4.

From (12), one can see that

(16)

This will be useful later. Equation (12) shows that a positive leads to a positive . The same holds for a negative sign since is, in fact, monotone increasing (see Proposition 2.1 below) and crosses zero at (the negativeness of when can also be deduced from the nondecreasing nature of via an inequality on page 20 of [45] for Gaussian variables).

The quantity is called a link function, and , , are called link coefficients. (Sometimes, slightly abusing the terminology, we shall also use these terms for and , respectively.) A key feature in (9) is that the effects of the marginal CDF and the ACVF are “decoupled” in the sense that the correlation parameters in do not influence the coefficients in (9) — this will be useful in estimation later.

Remark 2.1.

The short- and long-range dependence properties of can be extracted from those of . Recall that a time series is short-range dependent (SRD) if . According to one definition, a series is long-range dependent (LRD) if , where is the LRD parameter and is a slowly varying function at infinity [41]. The ACVF of such LRD series satisfies . If is SRD, then so is by (16). On the other hand, if is LRD with parameter , then can be either LRD or SRD. The conclusion depends, in part, on the Hermite rank of , which is defined as . Specifically, if , then is SRD; if , then is LRD with parameter (see [41], Proposition 5.2.4). For example, when the Hermite rank is unity, is LRD with parameter for all ; when , is LRD with parameter for .

Remark 2.2.

The construction in (1)–(2) yields models with very flexible autocorrelations. In fact, the methods achieve the most flexible correlation possible for when and have the same marginal distribution . Indeed, let and define similarly with replaced by . Then, as shown in Theorem 2.5 of [46],

where is a uniform random variable over . Since and for a standard normal random variable , the maximum and minimum correlations and are indeed achieved with (1)–(2) when and , respectively. The preceding statements are non-trivial for only since is attained whenever . It is worthwhile to compare this to the discussion surrounding (15). Finally, all correlations in are achievable since in (13) is continuous in . The flexibility of correlations for Gaussian copula models in the spatial context was also noted and studied in [22], especially when compared to a competing class of hierarchical, e.g. Poisson, models.

The preceding remark all but settles flexibility of autocovariance debates for stationary count series. Flexibility is a concern when the count series is negatively correlated, an issue arising in the hurricane data in [33]. Since a general count marginal distribution can also be achieved, the model class appears quite general.

2.2 Covariates

There are situations where stationarity is not desired. Such scenarios can often be accommodated by simple variants of the above setup. For concreteness, consider a situation where non-random covariates are available to explain the series at time — call these . If one wants to have the marginal distribution , where is a vector-valued function of containing parameters, then simply set

(17)

and reason as before.

Link functions, not to be confused with in (12)–(13), can be used when parametric support set bounds are encountered. As an example, a Poisson regression with correlated errors can be formulated via

Here, the exponential link guarantees that the Poisson parameter is positive and are regression coefficients. The above construct requires the covariates to be non-random; should covariates be random, the marginal distribution may change.

2.3 Calculation and properties of the Hermite coefficients

Several strategies for Hermite coefficient computation are available. We consider the stationary setting here for simplicity. Since in (2) is discrete, the following approach proved simple, stable, and revealing. Let denote all parameters appearing in the marginal distribution . For fixed, define the mass and cumulative probabilities of via

(18)

where dependence on is notationally suppressed. Note that

(19)

(take as a convention). When , we take and, when , we take . Using this in (8) provides, for ,

Plugging (7) into the above equation and simplifying provides

(20)

The telescoping nature of the series in (20) provides

(21)

(convergence issues are addressed in Lemma 2.1 below). When (that is, or ), the summand is interpreted as zero. Before proceeding, the following results will clarify a number of coefficient issues. The key technical step is Lemma 2.1 below. As noted in these remarks and also in the next section, (21) is appealing from a numerical standpoint and also sheds light on the behavior of the Hermite coefficients.

Lemma 2.1.

The representation in (21) is valid whenever for some .

Proof.

Observe that one obtains (21) from (20) if, after changing to for notational simplicity,

(22)

To see that this holds when for some , suppose that for all , since otherwise the sum in (22) has a finite number of terms. Since is a polynomial of degree , for some constant that depends on . The sum in (22) can hence be bounded (up to a constant) by

(23)

To show that (23) converges, it suffices to show that

(24)

since as

. Mill’s ratio for a standard normal distribution states that

as . Substituting gives as . Taking logarithms in the last relation and ignoring constant terms, order arguments show that as . Substituting into (24) provides

(25)

For any and , one can verify that . Using this in (25) and , it suffices to prove that

(26)

for some . Since and are assumed, the Markov inequality gives . Thus the sum in (26) is bounded by

(27)

But (27) converges whenever . Choosing such a proves (22) and finishes our work. ∎

Remark 2.3.

From a numerical standpoint, the expression in (21) is evaluated as follows. The families of marginal distributions considered in this work have fairly “light” tails, meaning that approaches unity rapidly as . This means that becomes exactly unity numerically for small to moderate values of . Let be the smallest such value. For example, for the Poisson distribution with parameter and Matlab software, , , and . For , the numerical value of is infinite and the terms in (21) are numerically zero and can be discarded. Thus, (21) becomes

(28)

Alternatively, one could calculate the Hermite coefficients using Gaussian quadrature methods, as discussed e.g. in [22], p. 51. The approach based on (28) though is certainly simpler numerically. Furthermore, as noted below, the expression (28) can shed further light on the behavior of the Hermite coefficients.

Remark 2.4.

Assuming that the are evaluated through (28), their asymptotic behavior as can be quantified. We focus on , whose squares are the link coefficients. The asymptotic relation for Hermite polynomials states that as for each . Using this and Stirling’s formula ( as ) show that

(29)

Numerically, this approximation, which does not involve Hermite polynomials, was found to be accurate for even moderate values of . It implies that decays (up to a constant) as . While this might seem slow, these coefficients are multiplied by in (9), which decay geometrically rapidly in to zero, except in degenerate cases when .

Figure 1: The link coefficients on a log-vertical scale for the Poisson (left) and negative binomial (right) distributions.

The computation and behavior of the link coefficients are now examined for several families of marginal distributions. Figure 1 shows plots of on a vertical log scale over a range of parameter values for for the Poisson and negative binomial (with ) distributions. A number of observations are worth making.

Since and by construction, the parameter values in Figure 1 with close to (or close to ) implies that most of the “weight” in the link coefficients is contained in the first coefficient, with higher order coefficients being considerably smaller and decaying with increasing . This takes place in the approximate ranges for the Poisson distribution and in the negative binomial distribution with . Such cases will be called “condensed”. As shown in Section 2.4 below, in the condensed case is close to . In the condensed case, correlations in and are similar.

Non-condensed cases are referred to as “diffuse”. Here, weight is spread to many link coefficients. This happens in the approximate ranges for the Poisson distribution and and for the negative binomial distribution with . This was expected for small s and small s: these cases correspond to discrete random structures that are nearly degenerate in the sense that they concentrate at (as or ). For such cases, large negative correlations in (15) are not possible; hence, cannot be close to and correlations in and are different. The diffuse range for the negative binomial distribution remains to be understood, although it is likely again some form of degeneracy.

2.4 Calculation and properties of link functions

We now study calculation of in (13), which requires truncation of the sum to for some . Note again that the link coefficients are multiplied by in (9) before they are summed, which decays to zero geometrically rapidly in for most stationary of interest when . The link coefficients for large are therefore expected to play a minor role. We now set and explore consequences of this choice.

Remark 2.5.

An alternative procedure would bound (29) by . Now let be the smallest for which the bound is smaller than some predetermined error tolerance . In the Poisson case with , for example, such are , and . These are close to the chosen value of . A different bound and resulting truncation in the spatial context can be found in [22], Lemma 2.2.

Figure 2: The link function for the Poisson distribution with , , and (left) and the negative binomial distribution with and , , and (right).

Figure 2 plots (solid line) for the Poisson and negative binomial distributions for several parameter values. The link function is computed by truncating its expansion to as discussed above. The condensed cases and (perhaps this case is less condensed) and lead to curves that are close to . However, the diffuse cases appear more delicate. Diffusivity and truncation of the infinite series in (13) lead to a computed link function that does not have (see (14)); in this case, one should increase the number of terms in the summation.

Though deviations from might seem large (most notably for the negative binomial distribution with ), this seems to arise only in the more degenerate cases associated with diffusivity; moreover, this occurs only when linking an ACVF of for lags for which is close to unity. For example, note that if the link deviation is from unity at (as it is approximately for the negative binomial distribution with ), the error for linking as 0.8 (or smaller but positive) would be no more than . In practice, any link deviation could be partially corrected by adding one extra “pseudo link coefficient”, in our case, a 26th coefficient, which would make the link function pass through . The resulting link function is depicted in the dashed line in Figure 2 around the point and essentially coincides with the original link function for all ’s except possibly for values that are close to unity.

The situation for negative and, in particular, around is different: the theoretical value of in (15) is not explicitly known. However, a similar correction could be achieved by first estimating through a Monte-Carlo simulation and adding a pseudo 26th coefficient making the computed link function connect to the desired value at . This is again depicted for negative via the dashed lines in Figure 2, which is visually distinguishable only near (and then only in some cases).

Remark 2.6.

In our ensuing estimation work, a link function needs to be evaluated multiple times; hence, running Monte-Carlo simulations to evaluate can become computationally expensive. In this case, the estimation procedure is fed precomputed values of

on a grid of parameter values and interpolation is used for intermediate parameter values.

The next result further quantifies the structure of the link function. The result implies that is nondecreasing as a function of . The link’s strict monotonicity is known from [20] when is non-decreasing and differentiable, which does not hold in our case. (Non-strict) monotonicity for arbitrary non-decreasing is also argued in [7]. Our argument extends strict monotonicity to our setting and identifies an explicit form for the link function’s derivative.

Proposition 2.1.

Let be the link function in (13). Then, for ,

(30)

In particular, is monotone increasing for .

This result is proven in Appendix A.

Remark 2.7.

The antiderivative

does not seem to have a closed form expression for general . (If it did, then one could integrate (30) explicitly and get a closed form expression for .) But a number of numerical ways to evaluate the above integral over a finite interval have been studied; see, for example, [19], Section 2.

2.5 Particle filtering and the HMM connection

This subsection studies the implications of the latent structure of our model, especially as it relates to HMMs and importance sampling approaches. This will be used in constructing particle filtering (PF) approximations to the true likelihood of the model and in goodness-of-fit assessments. The first suggested PF approximation of the likelihood is essentially the popular Geweke-Hajivassiliou-Keane (GHK) sampler for the truncated multivariate normal distribution, as discussed in more detail in Remark 2.9 below. Otherwise, we adhere more closely to the terminology and approaches from the HMM literature.

Our main HMM reference is [15]. As in that monograph, the observations are taken to start at time zero. The following notations are key: let denote the one-step-ahead linear prediction of the latent Gaussian series from the history . This will be expressed as . The weights , , can be computed recursively in and efficiently from the ACVF of via the Durbin-Levinson (DL) algorithm, for example. By convention, . Let be the corresponding mean-squared prediction error.

The following problems take centre stage:

Filtering
Prediction

as well as numerically computing and for a given function . Here, refers to an expectation conditioned on . These quantities are needed to evaluate likelihoods in Section 3.2 and for model diagnostics in Section 3.3.

Our first task is to derive expressions for the above distributions. We use the notation

(31)

its role stemming from