Semiparametric Bayesian Forecasting of Spatial Earthquake Occurrences

02/05/2020 ∙ by Aleksandar A. Kolev, et al. ∙ 0

Self-exciting Hawkes processes are used to model events which cluster in time and space, and have been widely studied in seismology under the name of the Epidemic Type Aftershock Sequence (ETAS) model. In the ETAS framework, the occurrence of the mainshock earthquakes in a geographical region is assumed to follow an inhomogeneous spatial point process, and aftershock events are then modelled via a separate triggering kernel. Most previous studies of the ETAS model have relied on point estimates of the model parameters due to the complexity of the likelihood function, and the difficulty in estimating an appropriate mainshock distribution. In order to take estimation uncertainty into account, we instead propose a fully Bayesian formulation of the ETAS model which uses a nonparametric Dirichlet process mixture prior to capture the spatial mainshock process. Direct inference for the resulting model is problematic due to the strong correlation of the parameters for the mainshock and triggering processes, so we instead use an auxiliary latent variable routine to perform efficient inference.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Hawkes process (hawkes1971spectra) is a widely used point process model for describing events which are clustered in time and/or space. Applications of the Hawkes process span a diverse range of fields, such as credit risk (errais2010affine), criminology (mohler2011self), neuroscience (chornoboy1988maximum), social interaction modelling (crane2008robust), and terrorism (porter2012self). A particular form of the Hawkes process, known as the Epidemic-Type Aftershock Sequence (ETAS) model, is used in seismology to forecast earthquake occurrences and quantify seismic risk (ogata1988statistical; marzocchi2009real; schoenberg2013facilitated; omi2014estimating; omi2015intermediate; fox2016spatially).

The standard ETAS model has two components: an inhomogeneous spatial background process which describes the long-term rate of mainshock activity in a geographical area, and a triggering component which describes the rate and location of aftershock events. Parameter estimation in the ETAS model is known to be difficult for two reasons (veen2008estimation; zhuang2002stochastic): first, the likelihood function is complex and exhibits multi-modality and extended flat regions. Second, the model requires a specification of the inhomogeneous background process which can be challenging when only a limited number of mainshock events are available. Due to these complications, the majority of the ETAS literature focuses only on point estimates of the model parameters and ignores any estimation uncertainty, although some recent exceptions are (fox2016spatially) and (Wang2010).

While the majority of previous work on the ETAS model is carried out in a frequentist framework, the Bayesian paradigm provides a natural alternative for incorporating parameter uncertainty. However the application of Bayesian methods to the ETAS model have been limited by the complexity of the likelihood function, and many of the approaches which purport to be Bayesian still rely on point estimates of model parameters (ebrahimian2013adaptive)

with the prior only used for regularisation purposes. Limited attempts have been made at performing full Bayesian inference for the temporal ETAS model (without a spatial component) using Markov Chain Monte Carlo

(omi2015intermediate; Ebrahjmian2017; kumazawa2014nonstationary) but a recent technical report by one of the current authors (Ross2018) casts doubt on the validity of this procedure since the high correlation of parameters in the likelihood function means the effective sample size will be unreasonably low even for a large number of simulations. Additionally, the Bayesian estimation of the inhomogeneous background rate has not to our knowledge been previously considered.

In this paper we present the first fully Bayesian treatment of the ETAS model. Our main innovation is the use of a nonparametric Dirichlet process prior to estimate the inhomogenous spatial background rate. Due to the complexity of the ETAS likelihood function, direct estimation of the resulting nonparametric model is not feasible. As such, we develop a novel estimation procedure using auxiliary latent variables which is an extension of recent work by (Markwick2020) originally proposed for temporal (non-spatial) Hawkes processes. Although our application here is focused on seismology, a similar version of our model could be deployed in other spatial Hawkes applications such as modelling crime hotspots (mohler2011self).

We begin in Section 2 with a brief introduction to the spatial ETAS model and its use in modelling clustered event sequences. Then in Section 3 we introduce the Dirichlet process prior that will be used for nonparametric spatial estimation. We next discuss posterior inference strategies using auxiliary latent variables in Section 4. Issues relating to prior choice and goodness-of-fit testing are discussed in Sections 5, 6 and 7. Finally, we study the performance of the resulting spatial ETAS model on synthetic and real earthquake catalogs in Sections 8 and 9 respectively.

2 Spatio-temporal ETAS model

In the ETAS formulation (ogata1998space), earthquakes are modelled as events from a marked point process on the temporal interval and spatial region , where denotes the time at which the earthquake occurred, with and denoting its magnitude and spatial location. It is well known that earthquakes cluster together in both space and time, since large earthquakes tend to trigger further earthquakes nearby, known as aftershocks (utsu1995centenary). To capture this behaviour, the ETAS point process uses the following conditional intensity function:


where denotes the history of the process before time , and is the magnitude of completeness of the catalog, which is determined empirically and corresponds to the minimum magnitude above which all earthquakes are successfully detected (gutenberg1944frequency; wiemer2000minimum). The magnitudes are then assumed to be independent and identically distributed according to the usual Gutenberg-Richter (G-R) law (gutenberg1944frequency; fox2016spatially). As a point of notation, we will suppress all explicit dependence on and simply write for the conditional intensity function.

The function is known as the background process and specifies the baseline intensity, while the functions determine the contribution of each previous earthquake at to the intensity at time , and are typically chosen to be monotonic decreasing. This implies that each earthquake causes the process intensity to temporarily increase for a period of time, producing local clusters of events.

The ETAS model can equivalently be viewed as a branching process, as first noted by (hawkes1974cluster). At each time , suppose that events occurred prior to . From Equation 1, the conditional intensity at time is a linear superposition of independent inhomogeneous Poisson processes, where the first is the background process which contributes intensity and the remainder are indexed by each of the previous events, with each contributing intensity respectively. Since these processes are independent, each event time can be assumed to have been generated either by the background process processes triggered by one of the previous events. As a point of convention, we will refer to the events from the background process as immigrant events or mainshocks, with the triggered events being aftershocks.

A visual example of such a branching structure is shown in Figure 1. Events , and are the immigrants that initiate all other events and which are uncaused by any other event in the sequence. The events , and are caused by , while is an aftershock of . Similarly and are aftershock of , while is caused by , which is a aftershock of . There are no detected aftershock events for , although some might occur in the future.

Figure 1: Example of a Branching structure

2.1 Specific functional form.

To complete the specification of the ETAS model, we need to make choices for the functions , and . It is common to take to be the modified Omori law which has been empirically shown to capture the temporal decay of earthquake productivity (utsu1995centenary):

where and are parameters controlling the decay rate, while controls the average productivity. Similarly, many empirical studies (ogata1988statistical; ogata2006space; ogata2011significant; fox2016spatially) have shown that a good choice for is:

Several forms have been proposed for the spatial kernel (ogata2011significant; fox2016spatially), however we will focus on the most widely used version:


with parameters and , which corresponds to long-range power law decay.

Finally, the background intensity defines the long term average seismicity over the spatial region being considered. This is the most difficult function to choose, since it will depend on the particular spatial region being studied. Many studies assume that seismicity is constant over space so that , however this is highly unrealistic since it is known that earthquakes tend to occur along geological fault lines. As such, several methods have been proposed to model and estimate the spatial dependence of , such as (ogata2011significant; fox2016spatially; Gerstenberger2004; Helmstetter2006), which typically uses some kind of kernel smoothing. However the limitation of this approach is that it is difficult to capture the inherent uncertainty in the estimate, which can be especially problematic in regions where only a small number of historical earthquakes are mainshocks from rather than triggered events.

The primary goal of this paper is to develop a novel nonparametric Bayesian method for the efficient estimation of which captures all underlying uncertainty. We will take this up in Section 3 after first discussing the likelihood function for the ETAS model.

2.2 (Log-)Likelihood

The log-likelihood function for a general space-time point process with intensity , observed data

and parameter vector

is given by the following expression (daley2003introduction):


The evaluation of the triple integral term in Equation 3 is slow and can be numerically unstable, and so several approximations are provided in the literature (harte2012bias; ogata1998space; schoenberg2013facilitated; lippiello2014parameter). The calculation of this integral is more feasible if the spatial and temporal kernels and

are reparameterised to be probability densities that integrate to

, which is achieved by splitting into a product of normalisation constants where:


Using similar notation, we write where is a probability density that integrates to 1. Under this notation, the triple integral in Equation 3 can then be approximated as:

This is only an approximation since it assumes that the temporal and spatial domains are infinite, when in fact the point process is defined on the finite region . In schoenberg2013facilitated, it was shown that the assumption of an infinite spatial domain has negligible effect on the likelihood and so we will use this approximation to reduce the computational cost of evaluating the triple integral. However, the assumption of an infinite time domain can be problematic. In kolev2019inference, it was shown that the infinite time assumption provides a poor performance regarding the North California seismic sequence, thus we prefer not to use it whenever possible. As such, we will avoid using it and instead work on the finite temporal region .

In summary, the log-likelihood of the Spatial ETAS model based on the finite time and infinite space assumptions can be approximated by the following expression:


with a parameter vector along with the parameters that definite the background kernel .

3 Nonparametric Estimation of Background Intensity

We will consider a variety of methods for estimating the spatial background rate in the ETAS model. Using the parametrisation above, we can write:

where is a scaling constant and is a probability density that integrates to . We will consider three different Bayesian models for the background rate. In all three, the constant will be estimated separately, with the models varying in terms of how they treat :

  1. , in which case the background intensity is constant over space. This is highly unrealistic due to the known fact that seismic activity is highly spatially dependent, and we use this model only as a baseline.

  2. A nonparametric model where

    is learned using Kernel Density Estimation (KDE). Different versions of this method are fairly common in the seismology literature, albeit in a non-Bayesian context

    (zhuang2002stochastic; marsan2008extending; sornette2009limits; fox2016spatially; Helmstetter2006). Note that fitting a KDE directly to the observed earthquake catalogs will result in a biased estimate of since is specifically a model for the background (immigrant) events only, and not for the triggered events.

  3. A nonparametric model where is learned in a fully Bayesian manner using a spatial Dirichlet Process prior, in a way which distinguishes between background and triggered events. This is substantially more complex than the KDE approach since it require declustering the earthquakes into background and immigrant events, with only the background events used to estimate . This corrects the bias in the KDE approach.

The first model using the uniform density is self-explanatory. We will now discuss the other two in more detail.

3.1 Kde Etas

The second method described above uses kernel density estimation to learn . Suppose we observe observations from the point process. Let be the spatial coordinates of the earthquake. Then a KDE estimate of can be given by:


where is , symmetric and positive-definite which is also referred to as bandwidth matrix and is a symmetric kernel function. Without loss of generality we can choose it to be:

The estimated is then treated as being a fixed constant while the other model parameters including are estimated. We refer to this model as KDE ETAS. Although Kernel Density Estimation is a powerful and flexible nonparametric method, it has two drawbacks (marsan2008extending; sornette2009limits). First, it can be difficult to choose in a manner which produces an accepted level of smoothing across the whole spatial domain rather than under/over fitting in particular regions. Second, the resulting KDE estimate of is based on smoothing over all earthquakes in the historical catalog. However this is not the correct behaviour in the context of an ETAS model, since the function specifies the occurrence of only background, rather than triggered events. As such, we would expect the KDE estimate to be biased, and assign too much probability mass to spatial regions where large magnitude earthquakes have occurred, since their large number of triggered aftershocks will be incorporated into the estimate of . To some extent this problem can be avoided by first declustering the catalog before fitting the KDE as in Helmstetter2006 however it is difficult to choose an adequate bandwidth when fitting to such a declustering catalog, and the inability to incorporate estimation uncertainty into the KDE is problematic when fitting to declustered catalogs where the number of background events may be low. These problems are avoided in the fully Bayesian approach which we will now discuss.

3.2 Dp Etas

To avoid the drawbacks of KDE, we propose estimating using a Dirichlet process mixture prior. The Dirichlet process (DP) was introduced by ferguson1973bayesian; antoniak1974mixtures

as a probability distribution over probability distributions, and is commonly used as a prior in Bayesian nonparametric modelling. If a probability distribution

has a DP prior then we write:

where is the base distribution which defines the expected value of the DP and

is a measure of the variance. The DP is a conjugate prior in the following sense: suppose that

where . Then the posterior distribution of is:

where is a Dirac delta function (dirac1947principles).

A constructive definition of the DP was given by sethuraman1994constructive, who showed that samples from a DP can be written in stick breaking form:

where , , and is the Dirac delta function (dirac1947principles). This provides a practical method for drawing a sample from a DP, by approximating the stick breaking as a finite (truncated) sum:

Combining this with the conjugacy result above, we can hence sample from its posterior distribution given some observed data as:

where and

An alternative representation of the DP is based on the Chinese restaurant process (neal2000markov), which shows that the marginal prior distribution of the samples (with integrated out) can be written as:


The Dirichlet Process as a Spatial ETAS Prior

In this work, we propose to use the Dirichlet Process (DP) as a nonparametric prior for the background ETAS intensity . From the above results, we can see that samples from a DP follow a discrete distribution. In order to adapt the DP to continuous data, it is common to instead use it as a prior distribution for a mixture model. This leads to the following specification:

where is a mixture kernel. This formulation corresponds to an infinite dimensional mixture model where the DP is used as a prior on the mixing distribution parameter.

Since is a two-dimensional spatial distribution, we will model it as a mixture of bivariate Gaussians, where is the mean vector and precision matrix. For conjugacy, we choose to be the Normal Inverse-Wishart distribution. This leads to the following model:

where and are the parameters of the Normal Inverse-Wishart distribution where the mean

follow a Gaussian distribution

and the precision matrix comes from a Wishart distribution (wishart1928generalised)

From here after, this ETAS alternative will be referred to as DP ETAS model.

4 Posterior Simulation

We propose to estimate all model parameters of the ETAS model using Bayesian inference. Given a prior distribution and earthquake catalog data , the resulting posterior is:


The multi-dimensional integral in Equation 5 cannot be solved analytically for the ETAS model. For this reason, we will use simulation methods to approximate the posterior instead.

It may initially seem feasible to use Metropolis-Hastings or Gibbs sampling to draw samples from the posterior. However this are serious grounds to doubt that this will work. Recall from Equation 4, that the log-likelihood of the ETAS model is:


The first problem is that evaluating the likelihood function requires a double summation and this evaluation must take place each time a new parameter value is proposed. As such, direct MCMC would be computationally very demanding, and cannot feasibly be run on a catalog containing more than a few hundred earthquakes. The second problem is that schoenberg2013facilitated studied the performance of frequentist maximum likelihood estimation for the ETAS model based on directly maximising the above likelihood function when was a constant value. They found that the resulting parameter estimates often differed substantially from their true values. This is because the likelihood function is multi-modal and the components of the parameter vector are highly correlated. Since MCMC methods can also suffer from serious convergence issues when the parameters are correlated, it is reasonable to believe that this direct MCMC procedure will suffer from the same problem. Since this problem is already present in the simple parametric case with constant , it will be even worse in the more complex nonparametric setting.

As such, we instead propose a reparametrisation of the model based on latent variables that aims to break the parameter correlation in the likelihood function and lead to an efficient Metropolis-Hastings algorithm for posterior sampling. This is an extension of the method proposed by Markwick2020 for the temporal Hawkes process.

4.1 Latent Variable Formulation

As discussed in Section 2, the ETAS model can be reinterpreted as a branching process in the following sense. Suppose that at time there have been previous events. Then, Equation 1 can be interpreted as showing that the intensity function at time is a sum of different Poisson processes. The first is a time-homogenous Poisson process with intensity , while each of the other processes is triggered by the previous events. Specifically, for each , the event that occurred at time triggers an inhomogeneous Poisson process with intensity:


Now consider an event that occurs at time , where there have been previous events. Based on standard results about the superposition of Poisson processes (daley2003introduction) we can interpret event as having been generated by a single one of these processes. For an earthquake catalog that contains events in total, we hence introduce the latent branching variables where indexes the process which generated :

Conditional on knowing , we can partition the earthquakes into sets where:

so that is the set of immigrant events which were not triggered by previous earthquakes, and is the set of direct aftershocks triggered by the earthquake at time . It is clear that these sets are mutually exclusive and that their union contains all the earthquakes in the catalog. Additionally, we can see that the earthquakes in set are generated by an inhomogeneous Poisson process with intensity , while the events in each set for are generated by a single inhomogeneous Poisson process with intensity given by Equation 7. The ETAS likelihood function from Equation 4 can hence be rewritten (conditional on knowing the latent branching variables) as:


where as before where is a constant and has a KDE or DP mixture prior. From this factorisation, we can see that and are independent of the other model parameters in the likelihood, and hence will be independent in the posterior assuming prior independence. This latent variable formulation hence breaks the dependency which would have made the KDE or DP part of the model difficult to learn, and further weakens the dependence between , and which will allow for more efficient MCMC sampling.

We hence propose a Gibbs sampler which samples the parameters in the following conditionally independent blocks:

Given a set of model parameters at iteration of the Gibbs sampler, we now explain how to sample the next value from the above full conditional distributions.

4.1.1 Sampling

As shown by zhuang2002stochastic in a very different context (stochastic declustering), each individual branching variable can be sampled exactly from its conditional posterior. Note that each can take values only in the discrete set , i.e. each earthquake can be triggered only by either a previous earthquake, or the background process. Assuming a uniform prior on each , the probability of it being caused by any of the processes is simply the proportion of the overall intensity that can be attributed to that process, i.e.:


Each can hence be drawn independently from the discrete distribution on , with weights given by the above (kolev2019inference). Note that unlike the MCMC sampler used in rasmussen2013bayesian for Hawkes processes, this samples exactly from its conditional posterior, which drastically improves computational efficiency.

4.1.2 Update

This step is required only for DP ETAS since is constant in the KDE version of the model. Recall from Section 3.2 that

where is the generating function for the immigrant earthquakes that are assigned to the background process based on the current branching structure.

In order to simulate a value of from its conditional posterior, we first simulate values of from their posterior distributions given the earthquakes which are assigned to the background process, using the usual Chinese Restaurant process sampler. Given these values, we then have from Section 3.2 that:

We can hence sample a value of from its posterior using truncated stick breaking, i.e:

This hence fully defines a realisation of from its posterior. Note that the reason why we need to simulate a realisation of (and hence ) rather than working only with the samples is that we need to have a realisation of to evaluate the branching posterior in Equation 9.

As part of this step, we can also perform an update of the DP hyperparameters of

and of , if we wish to work with the full hierarchical version of the DP. This can be done by assigning them a sensible prior distribution, such as . The updates in this case are standard in the DP literature and are described in detail in (gorur2010dirichlet).

4.1.3 Update the value of

Using Equation 8 we can observe that depends only on the number of events in the background process , hence:

This is equivalent to estimating the intensity of a homogeneous Poisson process on , with event times

. In this case, the Gamma distribution is the conjugate prior:

. The posterior distribution is then which can be sampled from directly (Ross2018).

4.1.4 Update the values of and

Similar to the process for sampling , we can sample new values of and from . Based on Equation 8, we conclude that:

Although there is no conjugate prior in this case, it is straightforward to use (e.g.) random walk MCMC to draw a sample from this posterior as described in the beginning of this section.

4.1.5 Update the values of and

Again, based on Equation 8, we can see that the posterior distribution of and is given by:

The parameter sampling can be done using (e.g.) standard random walk MCMC sampler.

4.1.6 Update the values of and

As a last step of our MCMC sampler we update the offspring kernel space parameters and . The expression below is a simplified approximation that depends on an infinite space approximation which was discussed in Section 2.2

5 Prior Choice, and Implementation Details

A common problem with the estimation and simulation of an ETAS model is that certain parameter values can result in infinitely many earthquakes being generated from the process with non-zero probability. It can be shown (helmstetter2002subcritical) that to guarantee a finite catalog, the average number of aftershocks produced by each earthquake in the catalog must be less than 1. The intuition for this result is that if the average number of aftershocks is greater than 1, then the branching process representation of the process may never converge.

By integrating over the Gutenberg-Richter magnitude distribution, it can be shown (Ross2018) that the average number of aftershocks will be less than 1 if:


where is the parameter in the G-R distribution (Section 2).

As such, we will choose our prior distributions so that positive mass is only assigned to regions where the above parameter relations are satisfied. We choose to use relatively uninformative Uniform priors: , , , , and , with the regions not satisfying the above relations assigned zero mass. Note that the priors for and are slightly informative which is required since these parameters are only weakly identifiable (holschneider2012bayesian)

Finally in the above discussion of the MCMC sampler, we mentioned that random walk Metropolis-Hastings was used to update some blocks of parameters. For these, we used a Normal proposal distribution with standard deviation of 0.1.

6 Model comparison

We have proposed three different versions of the Bayesian ETAS model, which respectively use the Uniform distribution, KDE and a DP mixture model to represent

. In this section we discuss the in- and out-of-sample performance metrics that we will use to compare these models.

6.1 Deviance Information Criterion (DIC)

The DIC is a fully Bayesian alternative to the Akaike Information Criteria (AIC) (akaike1973maximum). It replaces the maximum likelihood parameter estimates of with their posterior mean . The correction associated with the number of used parameters is substituted with a measure of parameter adequacy () (gelman2014bayesian). Given a set of model parameters the model’s DIC value is:

where is the log-likelihood function and is the effective sample size, which evaluates the number of independent samples the MCMC draws are equivalent to. It is defined as:

where indicates the th parameters’ sample in the considered MCMC chain. Alternatively, we can compute the effective sample size as the variance of the obtained log-likelihood values for all sampled parameters as follows:

This method is not as numerically stable as the other one but it is easier to compute because it does not require the allocation of , which is a computationally demanding task with respect to the of DP ETAS model. This measure further guarantees to provide positive values. For all these reasons we will use this alternative of the DIC metric through this paper.

6.2 Out-of-sample log-likelihood

A common way to evaluate the performance of earthquake models is to consider the out-of-sample predictive distributions, i.e. how well we can predict the occurrence time and locations of earthquakes in the time window given that we have fitted a model to the time window . Several versions of this approach have been used in the literature, with a summary given in bray2013assessment.

Since all of our models are Bayesian with completely specified probability distributions, we will compare based on the out-of-sample posterior predictive likelihood (i.e. the likelihood of the future observations averaged over the samples which have been drawn from the posterior).

7 Forecasting

We note that since our models are fully specified, we can use them to create forecasts about future earthquakes, e.g. the probability of an earthquake with magnitude greater than occurring during some future time period . This can be done by simulation, based on the samples drawn from the posterior using MCMC to produce simulated point process trajectories over the time period of interest, and then extracting the quantities to be forecasted as summary statistics. This is essentially a Monte Carlo approximation to the forecasting distribution:

8 Simulation Study

In this section we will use synthetic (simulated) data to evaluate and compare the performance of the three Bayesian ETAS models for using 1) the Uniform distribution, 2) KDE, and 3) a Dirichlet process mixture. The next section will similarly compare them using real earthquake catalogs.

8.1 Initial Comparison

We simulate three catalogs, each with a different choice for the density

. The first density that we consider follows a standard bivariate normal distribution i.e.


The second one is a mixture of two Normal distributions. The first of them has a mean of and the second one . They share a common covariance matrix that comprises zero covariance and standard deviation in each dimension i.e.


The third density aims to simulate a seismic fault - all events are uniformly distributed on a line with known boundary conditions. This requires the specification of a fixed spatial region . We sample uniformly a realisation of the range of . Then we transform it into a point on a line defined by an intercept and a slope and further scale it by an error component i.e.


for , where is the range in dimension and . We chose , , and .

For the remainder of the ETAS parameters, we choose parameters based on the Tohoku District, Japan catalog from over , N and , , E, which were estimated using maximum likelihood by ogata1998space. These are: with immigrant intensity constant and margin of completion .

The same parametrisation is used in fox2016spatially subject to the following amendments - , , and where is an indicator function. These simulations spread over temporal interval .

For our simulation, we choose to set to provide denser catalogs within a shorter period of time. The overall event rate has increased by which allows us to run simulations for a shorter period of time compared to the previously introduced examples. However, this is not going to affect negatively the performance of the remaining parameters.

All simulated catalogs in this section have magnitudes following the G-R law (gutenberg1944frequency) with b-value of 1 i.e. . Hence, all marks are greater than the specific margin of completeness used for the simulation, . Within the simulation study, we set temporal window in with extension interval and magnitude of completeness of .

8.2 Model Fitting and Results

For each of the three datasets, we used MCMC to draw samples from the posterior (after thinning). The branching structure was sampled from its conditional posterior only at every 50 iterations of the latent variable MCMC algorithm, since this is an operation and slower than the other updates. For the DP ETAS model, we also resample the immigrant events density function when a new branching structure is sampled.

For the KDE ETAS model, the estimate of the immigrant spatial density is based on the whole catalog of observations, and so is estimated prior to running the MCMC and set to a fixed value, as in zhuang2002stochastic; marsan2008extending; sornette2009limits; marsan2010new; fox2016spatially.

In the simulation example we developed an out-of sample comparison with respect to 30 out-of-sample periods for each dataset. In order to evaluate the estimate performance we used every parameter set across the sets that were obtained as part of the MCMC procedure. This gives estimates of the out-of-sample log-likelihood for each of the catalog out-of-sample periods. Table 1 shows the average out-of-sample performance for the three models across these catalogs.

5985.64 2495.93 4347.91 -1150.57 -1079.11 -1084.31
5984.60 2372.27 4583.25 -913.29 -862.19 -832.30
5918.93 1780.47 3670.92 -830.16 -777.72 -778.89
Table 1: Comparison between the performance of Unif (U), KDE (K) and DP (D) ETAS models across three uncaused events’ spatial distributions () with respect to the Tohoku District (ogata1998space) MLE estimated based simulated catalogs.

The obtained results for and show that KDE ETAS model outperformed DP ETAS model, and that both outperformed the Fixed ETAS model (Table 1). However, the results obtained based on show that DP is better than KDE with respect to all diagnostic tests.

Since these results are mixed and show that both DP and KDE are capable of outperforming each other depending on the model parameters, we will now develop a larger simulation study to gain insight into the factors which determine when each is most suitable.

8.3 Large Scale Simulation Study

In order to examine further the behaviour of the spatial ETAS models, we created a number of simulated data sets by varying the model parameters. We set , and to be constant. Then we consider: , , and . We exclude the combinations of parameters which result in an expected productivity greater than since these can potentially generate infinite catalogs as discussed in Section 5. This resulted in different parameter sets.

Table 2 shows the results for DP and KDE ETAS on the 63 simulations, using the three specifications for discussed in Section 8.1 . This table presents the number of datasets that allocate either KDE or DP as the best model based on either DIC or out-of-sample maximum likelihood or out-of-sample mean log-likelihood or with respect to all previous metrics (referred as best). We further provided the aggregated counts across all 189 simulations. It can be seen that both the DP and KDE versions of the model can outperform each other for different values of the model parameters. This shows that they are both likely to have value for estimating real-world catalogs.

It is interesting to understand the factors that make DP superior to KDE for particular data-sets, since this will give us a general rule for deciding which one is most appropriate to use. We propose the following hypothesis, which seems intuitively reasonable: since KDE forms its estimate of by using all the earthquakes in the catalog rather than only the immigrant events, we would expect it to perform well either when most earthquakes are immigrants, or when the true distribution of mainshocks is not too dissimilar to the overall distribution of earthquakes in the catalog.

We would expect the DP approach to perform better when is large (since this results in a higher proportion of aftershocks relative to immigrants), and also when the parameters and are large since this results in the distribution of triggered events being spread out over a wider areas, which increases the spatial discrepancy between the mainshock distribution and the overall catalog distribution.

subset model best
KDE 49 54 49 52 47
DP 14 9 14 11 8
KDE 26 33 25 27 23
DP 37 30 38 36 29
KDE 32 35 26 27 24
DP 31 28 37 36 28
All KDE 107 122 100 106 94
DP 82 67 89 83 65
Table 2: Number of datasets that allocate either KDE or DP as the best model based on either maximum log-likelihood or DIC or out-of-sample maximum likelihood or out-of-sample mean log-likelihood or with respect to all previous metrics (best).
Figure 2: Standardised differences of performance metrics of DP ETAS related to KDE ETAS with respect to the logarithmic transformation of every catalog overall Area across the three uncaused events’ spatial densities (Section 8.1). stands for the difference between in-sample log-likelihood values for DP ETAS minus KDE ETAS; stands for the difference between DIC values for KDE ETAS minus DP ETAS; stands for the difference between out-of-sample log-likelihood values for DP ETAS minus KDE ETAS. For ease of display all values are re-scaled to follow a zero mean, unit variance Normal distribution. The three solid lines on each sub-plot represent the fitted lines of the pattern with respect to the three discussed difference (in their respective colours). The horizontal dashed line indicate the threshold for which DP ETAS will be considered to outperform KDE ETAS.

To test this hypothesis, we will try to create a single measure which represents the discrepancy between and the overall catalog distribution. This relationship is primarily influenced by the overall area that the catalog spans. Since all immigrant events are restricted to lie within the same area (), the overall area of the catalog is driven primarily by the values of and and , which affects how the triggered events spread out relative to the immigrants. As such, we compute the resulting areas for each of the catalogs which were simulated by varying the parameter values. In Figure 2, we plot how the area of the catalog relates to the degree to which DP performed KDE. Specifically, we plot the difference in the DIC and out-of-sample log-likelihood values between DP and KDE, as a function of catalog area. It can be seen that there is a clear relationship between the performance measures and the overall area of a catalog - a larger area is associated with a better performance of DP ETAS. The correlation between these measures and the proportion of immigrant events () is -0.27 and -0.26 for the DIC and ML differences respectively. This largely confirms our previous hypothesis; in general, the DP model outperforms KDE when there are a large number of aftershocks that are spread out over a wide area, while KDE performs best when the number of aftershocks is smaller and more localised.

9 Real earthquake sequences

In this Section we explore the performance of spatial ETAS model(s) across four different real earthquake catalogs. Since the Uniform model is unrealistic and performs substantially worse than the other two, we will not consider it further and instead focus on comparing the KDE and DP models. A summary of the results is given in Table 3.

9.1 Vrancea, Romania

Vrancea is an area in Romania that has a strong seismic influence on South-Eastern Europe. On a large magnitude earthquake occurred which caused substantial destruction and human loss in both Bulgaria and Romania. We analysed the earthquake catalog from to that covers the spatial region , N and , E with a magnitude of completeness of . The data was split into a training set over the period with events and a test set over the period with comprises events. The data was obtained from the United States Geological Survey (USGS) catalog (

The goodness-of-fit information is shown on Table 3. It can be seen that DP ETAS outperforms KDE when it comes to out-of-sample forecasting of the test set period. Interestingly, KDE does better when performance is evaluated using the DIC. Since this is computed using only the in-sample period, this suggests that KDE is overfitting to the data, while the DP method is estimating a more parsimonious model which results in superior forecasting.

9.2 Zakynthos and Kefalonia, Greece

Zakynthos and Kefalonia are subject to prolonged seismic activity. The area of interest spans , N and , E. The most important event in the region is the M Ionian earthquake that occurred on . Including data from this period is very challenging due to the high magnitude of catalog completeness caused by poor quality seismic detection equipment. For this reason we focused our study on a more recent time period from to and chose a magnitude of completeness of to ensure consistency throughout the catalog. The data was split into a training set with events and a test set that comprises events. The data was obtained from the United States Geological Survey (USGS) catalog ( The results in Table 3 are the same as above: DP ETAS performs better for forecasting the test set, while KDE has a superior in-sample performance. Again, this suggests that KDE is overfitting and that the DP approach is more useful for forecasting.

9.3 Friuli, Italy

Friuli is an area in Italy that is primarily known for the M earthquake that occurred on , followed by multiple aftershocks with considerably large magnitudes. We based our study on the earthquake occurrence from to that covers the area within , N and , E with minimum magnitude of . For inferential purposes we split the data into a train set with events and a test set that comprises of events. The data was obtained from the United States Geological Survey (USGS) catalog (

As before, the goodness-of-fit results are shown on Table 3 and show the same pattern as above: DP ETAS performs better when it comes to out-of-sample forecasting, due to KDE overfitting to the training set.

9.4 Central Italy

In 2006, Central Italy suffered a particularly damaging magnitude 6.2 earthquake that caused the death of nearly people (luzi2017central). Although the type of point process models discussed in this paper are not appropriate for predicting the occurrence of individual earthquakes, they are particularly useful for forecasting patterns in aftershock sequences. To this end, we obtained earthquake data from the Italian National Institute of Geophysics and Volcanology (Instituto Nazionale Di Geofisica e Vulcanalogia from to that spans Italy within , N and , E with magnitude of completeness taken to be . Then we split the data into a training set from to ( events) which was used to estimate the model parameters, followed by a test set from to ( events) used to measure performance.

The model comparison results are shown on Table 3. In this case, the KDE approach outperforms DP ETAS when it comes to out-of-sample forecasting. Investigating the catalog more closely, we found that most of the detected earthquakes were localised in a fairly small shore area. As discussed in the previous section, we would hence expect the KDE approach to perform particularly well here.

Vrancea 2710.23 2731.23 -43.88 -37.86
Zakynthos 846.51 903.64 -36.66 -31.00
Friuli 946.66 995.58 -36.36 -29.11
Italy - 6318.75 -7403.23 11580.85 11552.51
Table 3: KDE and DP based spatial ETAS model comparison across real catalogs. Lower values of the DIC and larger (less negative) values of the out-of-sample likelihood indicate superior performance. The large value of the likelihood for the Central Italy catalog is due to the very large number of events compared to the other catalogs.

10 Conclusions

The classic frequentist methods commonly used in seismic forecasting typically assume that model parameters are known exactly, which can result in forecasts which are overly confident. To mitigate this, Bayesian approaches are becoming more common in seismology. Despite being one of the most popular forecasting models, the ETAS framework has rarely received a fully Bayesian treatment, and even studies which attempt Bayesian forecasting often up resorting to frequentist-style plug-in estimates of the model parameters (omi2015intermediate; ebrahimian2013adaptive). This is due to the highly complex nature of the posterior distribution. In this work, we have introduced a new posterior sampling scheme which can be scaled up to catalogs containing thousands of earthquakes, and demonstrated its efficiency. Our approach can easily be deployed on realistic seismic catalogs containing thousands of earthquakes.

In this work, we explored the most commonly used version of the spatial ETAS model and showed how a nonparametric Dirichlet process prior could be used to allow fully Bayesian inference for the underlying spatial density. A comparison to a fixed Kernel Density Estimate of this density showed very promising performance in out-of-sample forecasting tasks. From our experiments on synthetic catalogs, we are able to draw the conclusion that the DP approach performs best when there are a large number of triggered events with a spatial distribution that is spread out over the region, while KDE is more suited to catalogs where all earthquakes occur in a compact area.