# Prediction of aftershocks with GPR

Uncovering the distribution of magnitudes and arrival times of aftershocks is a key to comprehend the characteristics of the sequence of earthquakes, which enables us to predict seismic activities and hazard assessments. However, identifying the number of aftershocks is very difficult due to contaminations of arriving seismic waves immediately after the main shock. To overcome the difficulty, we construct a likelihood based on the detected data incorporating a detection function to which the Gaussian process regression (GPR) is applied. The GPR is capable of estimating not only the parameter of the distribution of aftershocks together with the detection function, but also credible intervals for both of the parameters and the detection function. A property of both the Gaussian process and a distribution of aftershocks are exponential functions leads to an efficient Bayesian computational algorithm to estimate the hyperparameters. After the validations through numerical tests, the proposed method is retrospectively applied to the catalog data related to the 2004 Chuetsu earthquake towards early forecasting of the aftershocks. The result shows that the proposed method stably estimates the parameters of the distribution simultaneously their uncertainties even within three hours after the main shock.

## Authors

• 5 publications
• 3 publications
• 3 publications
• 8 publications
• 2 publications
• 2 publications
06/14/2020

### Prediction of aftershocks with Gaussian process regression

Uncovering the distribution of magnitudes and arrival times of aftershoc...
02/06/2015

### Marginalizing Gaussian Process Hyperparameters using Sequential Monte Carlo

Gaussian process regression is a popular method for non-parametric proba...
02/12/2018

### Gaussian Process Classification with Privileged Information by Soft-to-Hard Labeling Transfer

Learning using privileged information is an attractive problem setting t...
04/16/2020

### Gaussian Process Learning-based Probabilistic Optimal Power Flow

In this letter, we present a novel Gaussian Process Learning-based Proba...
12/15/2021

11/16/2018

### Efficient Neutrino Oscillation Parameter Inference with Gaussian Process

Neutrino oscillation study involves inferences from tiny samples of data...
12/04/2017

### Gaussian Process Regression for Arctic Coastal Erosion Forecasting

Arctic coastal morphology is governed by multiple factors, many of which...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A large earthquake triggers a number of aftershocks. There have been proposed reasonable models to describe the distribution of aftershocks with temporal information as well as magnitudes such as the Omori-Utsu (Omori, 1894; Utsu, 1961) and the Gutenberg-Richter (Gutenberg and Richter, 1944) formulae. The distribution of aftershocks enables us to forecast seismic activities and hazard assessments (Resenberg and Jones, 1989, 1994; Kagan and Jackson, 2000). Until now, many statistical methods to estimate parameters involved in the models have been proposed (Aki, 1965; Ogata, 1983). The disadvantages in these methods are that they assume the existence of complete data without missing, though detecting all aftershocks immediately after the main shock is impossible due to contaminations of a tremendous amount of seismic waves. Such incomplete data cause underestimations in the counting of aftershocks.

In statistics, situations only some detected data being available are known as biased sampling problem (Vardi, 1982, 1985)

. In our case, detection probability of aftershocks clearly depends on the magnitudes and elapsed times from the main shock. This type of biased sampling data is known as missing not at random (MNAR), in which the detection probability depends on the values of undetected data. Introducing a detection function, which is a model of the detection probability, enables us to correct the bias

(Qin, 2017)

. Several studies have tackled this problem by introducing a parametric model on the detection function for aftershocks

(Ringdal, 1975; Ogata and Katsura, 1993, 2006; Omi et al., 2013, 2014, 2015a, 2015b). The detection function enables us to construct valid estimators in MNAR. Three problems remain to be solved: (i) the resulting estimators are often unstable; (ii) misspecification of the detection function causes bias; (iii) estimation of the detection function is difficult even with a correct model. The first problem (i) arises from the simultaneous estimation of the detection function and the distribution of aftershocks. The second problem (ii) is obvious because the bias correction strongly depends on how close the defined detection function to the true one is. The third problem (iii) is because some integration is required in the likelihood, which makes estimations difficult in biased sampling problems.

In this study, we propose a nonparametric Bayesian estimator to overcome all the three problems. Appropriate prior information considering characteristics of seismic activities in a target area improves the first problem (i). A technique of Gaussian process regression (GPR), which is nonparametric Bayesian estimation, models the detection function, simultaneously solving the problems (i) and (ii). Owing to the nonparametric modeling, the detection function does not need to have a specific functional form. The GPR has been often used especially in the recent machine learning and deep learning researches due to its flexibility and wide coverage of function spaces

(Rasmussen et al., 2006; de G. Matthews et al., 2018). As for computation of the model parameters, we propose an efficient Bayesian estimation algorithm utilizing the property that both the GP and the distribution of aftershocks are exponential functions, which are compatible to compute solving the third problem (iii).

Another advantage in GPR is that it is capable of evaluating uncertainty of the estimated parameters naturally, which has been hard in the previous works though knowing the uncertainty is inevitable to decide statistical decision. In summary, the proposed method can solve the three problems and has an additional property that can estimate the uncertainty of the parameters.

This paper is organized as follows. Section 2 introduces the GPR, and proposes a method to estimate parameters for distribution of aftershocks with a detection function through the GPR. Section 3 validates the proposed method through numerical tests. Section 4 demonstrates the effectiveness of the proposed method by applying to the catalog data of 2004 Chuetsu earthquake. Section 5 concludes the present study including perspective future works.

## 2 Methodology

The detection function proposed in this paper that models temporal changes of the detection rate of the aftershocks right after a main shock bases GPR. This section gives first a brief explanation related to the GPR, then introduces the proposed method especially its theoretical properties and an efficient computational algorithm.

### 2.1 Gaussian process regression

Some recent studies in the solid Earth science used the GPR to construct models from given data in the cases that the physical or chemical process that produced the data was unknown or too complicated. The GPR estimates a regression function simultaneously with its uncertainty through Bayesian nonparametric estimation. For example, Kuwatani et al. (2018)

proposed to apply the GPR to interpolate the observed quantities of chemical compositions along with the radius of a rock. The estimated uncertainty often provides valuable information for observational or experimental design, such as a suggestion of times and/or places of the next new observations or measurements.

In this study, GPR is used as a tool for Bayesian nonparametric estimation for a regression function. Let and

be sets of explanatory and response variables, respectively, and both are assumed to relate each other through an unknown regression function

, i.e., , where denotes the number of time points. In this paper, we use the notation like to abbreviate the arguments by a dot emphasizing that is a function. The GPR estimates the function

from the given a dataset assuming a kernel function mentioned below. In probability theory and statistics, a Gaussian process (GP) is a stochastic process such that every finite collection of those random variables has a multivariate normal distribution, e.g.

follows a multivariate normal distribution. One can think of the GP as defining a distribution over functions for a target function that promotes effective Bayesian estimation giving a priori information of the function (Rasmussen et al., 2006). The GP is usually denoted as GP, , where is mean function and

is variance function or “kernel”. The mean function

is often assumed as identically zero since the mean is adjustable by subtracting the sample mean of

. The radial basis function is often chosen among various candidate functions for the variance function or kernel

:

 K(x1,x2)=ϕ1exp{−(x1−x2)2ϕ22}(ϕ1,ϕ2>0), (1)

where and are arbitrary real numbers. This study also adopts the radial basis function for the kernel since it adequately covers an infinite-dimensional function space of with only a few hyperparameters. See Rasmussen et al. (2006) for the detailed explanation of GP including other kernel functions.

Estimation of a function is equivalent to that of the value of at any fixed point , where the superscript “” is used to discriminate the fixed point from the data points. For any point a posterior distribution of given a dataset

is called the predictive distribution. By using the Bayes’ theorem, calculation of the predictive distribution is given by

 p(f∗∣x∗,D)=∫p(f∗∣x∗,f,D)p(f∣D)df, (2)

where and

are the conditional probability density function (PDF) of

given the fixed point , unobserved , and the dataset The PDF is the prior on given by a multivariate normal distribution as mentioned above. In this case, the predictive distribution can be explicitly obtained and becomes a normal distribution again because both and are normal distributions. Therefore, the predictive distribution for a point is a normal distribution with mean and variance computed as

 μf(x∗)=κ⊤∗K−1y,σ2f(x∗)=κ∗∗−κ⊤∗K−1κ∗,

where and . The maximization of the marginal likelihood

 n∏i=1p(yi∣xi;ϕ)=n∏i=1∫p(yi∣xi,f;ϕ)p(f∣xi;ϕ)df. (3)

determines the hyperparameters , where the integration in the right-hand side is explicitly computable since the integrand given as the product of normal distributions and is again a normal distribution.

Figure 1 shows the estimated predictive distributions from six data points with changing hyperparameters and in the kernel (eq. 1

). As mentioned above, the GPR successfully obtains not only the mean function but also its standard error or credibility. Figure

1 also indicates and strongly associate with scale and shape of the regression function, respectively. A comparison between Figures 1(a) and 1(c), or 1(b) and 1(d) indicates that small/large means a small large/credible interval. Another comparison between Figure 1(a) and 1(b), or 1(c) and 1(d) shows that small/large means an oscillating/smoothed regression function. These results indicate the importance of deciding the hyperparameters.

### 2.2 Notation and models

According to the Omori-Utsu law, aftershock occurrence rate at elapsed time from the main shock follows a non-stationary Poisson process (Omori, 1894; Utsu, 1961):

 n(t;τ)=K(t+c)p,

where

is a vector containing all the model parameters, i.e.,

. The parameter controls the level of seismic activity, i.e., large/small reflects the large/small number of aftershocks. The parameter is the slope of the occurrence rate in the logarithmic scale. It is known that the occurrence rate is saturated over a period of time right after the main shock (Utsu, 1961; Ogata, 1983), and the parameter determines the time length of the saturation. According to the Gutenberg-Richter’s law, intensity rate of the magnitude is modeled as a function proportional to an exponential function (Gutenberg and Richter, 1944):

 m(M;b)=A10−bM∝exp(−βM), (4)

where is a constant, and (or ), which is of our most interest, controls the amplitude of the magnitudes. Combining these two laws, the joint occurrence rate of aftershocks as a function of elapsed time and magnitude is represented by the product of and as

 λ(t,M;τ,β)=K′(t+c)pβe−β(M−M0), (5)

where and is the magnitude of the main shock used to adjust the scale of (Utsu, 1970). Note that even if is obtained, can not be uniquely identified due to the unidentifiability of the product . Resenberg and Jones (1989, 1994) found that estimation of enables us to forecast seismic acitivities, so that the identifiability problem does not matter in this context. Hereafter we use instead of for notational simplicity.

An exact count of all aftershocks right after a main shock is very difficult due to contaminations of arriving seismic waves, so that the occurrence rate of aftershocks are almost always underestimated. In order to correct the bias, the present study adopts probit-type detection function as is used in Ringdal (1975), Ogata and Katsura (1993), Ogata and Katsura (2006), and Omi et al. (2013, 2014, 2015a, 2015b):

 π(t,M;μ,s) = P(δ=1∣t,M;μ,s) (6) = ∫M−∞1√2πs2exp{−(x−μ(t))22s2}dx,

where is a detection indicator that takes 1/0 if the aftershock is detected/undetected, is the magnitude that makes aftershock detectable with probability 50% at elapsed time , and is a scale parameter. Roughly speaking, the function is decreasing with respect to the elapsed time because only large aftershocks are detectable immediately after the main shock and even small aftershocks are detectable after enough time passes.

To make the difference between complete and detected data clear, let each and be the elapsed time and magnitude of a detected aftershock, respectively. The subscript “1” indicates the detected data, i.e., . Note that, in this study, complete data and might not be detected, but and are always available. Supposing that aftershocks are detected, let the pair of elapsed time from the main shock and magnitude of -th aftershock be . By using technique of the thinning operation or the random deletion in point processes (Ogata and Katsura, 1993), the likelihood function for the detected magnitudes can be written by

 n∏i=1L(M1i∣t1i;β,μ,s2) =n∏i=1e−βM1iπ(t1i,M1i;μ,s2)∫∞−∞e−βMπ(t1i,M;μ,s2)dM =n∏i=1βexp{−β(M1i−μ(t1i))−12β2s2}π(t1i,M1i;μ,s2), (7)

where the function is assumed to be known, although it is estimated later in practice. Let and be the estimated values for and obtained by maximizing the likelihood. With estimated and , can be estimated by maximizing the log-likelihood function for detected elapsed times within any time interval (Ogata and Katsura, 1993),

 ∑0

where is an intensity function for detected data defined by

 ν(t;τ,β,s) =∫∞−∞λ(t,M;τ,β)π(t,M;s)dM (9) =K(t+c)pexp{−β(μ(t)−M0)+12β2s2}. (10)

This paper proposes a more efficient estimation method for in Section 2.4, obtained from an investigation of the problem in the estimations of and as mentioned in Section 2.3.

However, it remains one serious problem that is unknown. In frequentist ways, Ogata and Katsura (1993) applied a B-spline basis function to , and Ogata and Katsura (2006) proposed a specific parametric model based on the 2003 Miyagi-Ken-Oki earthquake as

 μ(t)=a0+a1exp{−α(3+lnt)γ},

where , , , and are parameters to be estimated. Recently, Omi et al. (2013) proposed a flexible nonparametric Bayesian estimation. They assumed a prior on , and compute the posterior mean. Their method does not require any specification of and, because of Bayesian estimation, it can naturally incorporate the prior information of , which considerably makes estimates stable. The distribution of can be easily computed in the same area before the main shock. However, it is hard to estimate credibility of neither nor by their method. Estimating the credibility of the estimator is inevitable to decide statistical decision.

### 2.3 Estimation for β, s2, and π(⋅)

In this study, we put a Gaussian process (GP) prior on the “function” , not on the “points” . The GP prior leads to an explicit form of the marginal likelihood of hyperparameters and posterior distribution of as seen in the following discussion.

A graphical model for complete data is shown at the left panel in Figure 2. The graphical model illustrates the relations among data, parameters, and hyperparameters in accordance with Omori-Utsu and Gutenberg-Richter’s laws and the detection function, which enables us to grasp the notation and models in this study. Each green, red, blue color indicates the relation by Omori-Utsu law, Guterberg-Richter’s law, and the detection function, respectively. To avoid the problem, we focus on the partial likelihood, which is the distribution of detected magnitude given , derived in eq. (7). Once the graphical model for the detected data was obtained, one would realize that relation among , , and is exactly the same as the regression. Based on this idea, we put a Gaussian process prior on , that corresponds to in Section 2.1, and consider nonparametric bayesian estimation. However, unlike the GPR explained in Section 2.1, the distribution (eq. 7) is not a normal distribution, and the predictive distribution shall be more complicated.

The Gaussian process assumption on is denoted by , where is defined as

 K(x1,x2)=ϕ0+ϕ1exp{−(x1−x2)2ϕ22}(ϕ1,ϕ2>0).

As for , some reasonable function should be chosen because adjusting the mean of the prior to identically zero is impossible unlike the usual GPR. In this paper, a simple linear model is used as the mean of the prior distribution. The unknown parameters and are estimated by maximizing eq. (7) with the simple linear model. It can be seen later that this simple prior works well in the numerical experiments in Section 3. One difference between the above kernel and eq. (1) is the first term . This term prevents from being a nonsingular matrix and is set very small value (e.g. ) so that it does not affect estimation of the other parameters. The additional parameters to be estimated are only and , so that the remaining parameters are . Let be the known parameters prescribing the distribution of , i.e. subjective priors.

Predictive distribution of and the marginal likelihood of the hyperparameters are computable as follows. The proof is relegated to the Appendix.

• Marginal likelihood of :

 p(D∣θ) =p(θ;θ0)∫p(M1∣t1;β,μ,s2)p(μ∣t1;ϕ1,ϕ2)dμ =βnexp{−βn∑i=1(M1i−μ0,i)−12β2(ns2−∑i,jKi,j)} ×p(θ;θ0)∫MN(x; ~μ,~K)dx, (11)

where , , is the integration interval, , , is the by identity matrix, and is density function of the normal distribution with mean and variance .

• Predictive distribution of :

 p(μ∗∣t∗1,D) =∫p(μ∗∣t∗1,μ,D)p(μ∣D)dμ =Etrunc{N(μ∗; D∗(X), (τ∗)2)}, (12)

where is expectation on a variable , is the multivariate normal distribution with mean , variance , and truncated outside a region ,

 D∗(x)=(x+s2K−1~μ)⊤{τ2~K−1+~K−1κ∗κ⊤∗~K−1}κ∗κ∗∗,

, , and . In particular, mean and variance of the predictive distribution are and , where . The symbol “” is used to explicitly represent that the variable depends on the value .

• Predictive distribution of detection probability :

 π∗(M∗1,t∗1) =∫P(δ=1∣M∗1,μ∗,D)p(μ∗∣t∗1,D)dμ∗ =Etrunc{Ψ(M∗1−D∗(X)√s2+(τ∗)2)}, (13)

where

is the cumulant distribution function of the standard normal distribution.

Although both the marginal likelihood and the predictive distribution include integration, the integration can be regarded as expectation on some truncated multivariate normal distribution. This property is very important, and it reminds us to come up with the following computational algorithm to obtain the posterior samples of .

### 2.4 Estimation for τ

Recall that the log-likelihood function for is given in eq. (8), but is unknown. One possible way, as in the previous studies (Ogata and Katsura, 1993; Omi et al., 2013), is replacing , , and with estimated one. However, we now have obtained a predictive distribution of the detection function itself as given in eq. (12). Therefore, it is possible to replace the detection function (eq. 9) with the predictive one, and expected this new approach provides more efficient estimates since it uses full information of the estimated detection function, while the previous method uses only information of .

By replacing the detection function (eq. 9) with the predictive one, we have

 ν∗(t;τ,β,s) =∫∞−∞λ(t,M;τ,β)π∗(t,M;s)dM =K(t+c)pexp{βM0+12β2(s2+(τ∗)2)} ×Etrunc[exp{−β(D∗(X))}]. (14)

The known truncated multivariate normal distribution makes it possible to approximate the computation of the expectation with the Monte Carol method. As for integration on in eq. (8), we use a left Riemann sum with , e.g. , rectangles of equal width.

### 2.5 Computational algorithm

Both the predictive distribution and the marginal likelihood involve integration on multivariate normal distribution over an -dimensional hyperrectangle . It is known that explicit computation of the integral is very hard even when the dimension is low (Genz and Bretz, 2009). We divide the estimation into two parts: (i) estimation of hyperparameters ; (ii) computation of mean and variance of the predictive distribution.

We first propose a sampling method to obtain from the posterior distribution of by using the data-augmentation method (Tanner and Wong, 1987). Regarding a variable in eq. (11) as latent variables leads to another representation of the marginal likelihood

 p(D∣θ)=∫Mp(D,x∣θ)dx,

where

 p(D,x∣θ) =βnexp{−βn∑i=1(M1i−μ0,i)−12β2(ns2−∑i,jKi,j)} ×N(x; ~μ,~K).

With the idea of Gibbs sampling, augmented data can be sampled from following two steps:

1. Draw ;

2. Draw ,

Omi et al. (2013) used normal distribution with subjective hyperparameters estimated from the prior knowledge as a prior of . A non-informative constant prior is assumed for that of .

We use Gibbs sampling for sampling from the truncated normal distribution (Geweke, 1991, 2005). In R programming language, a package tmvtnorm (Wilhelm and Manjunath, 2015) can be used to sample from truncated normal distributions. As for sampling of

, the classical random walk MCMC (Markov Chain Monte Carlo) method

(Metropolis et al., 1953) works. For example, in the next section, a random walk MCMC with normal distributions as the proposal distribution is used. By repeating the two steps alternately, obtained becomes samples from posterior of . Hyperparameters are estimated as the median of obtained samples. Once samples of are obtained, can be estimated by its sample mean since is just the mean of the truncated normal distribution. Therefore, mean and variance of the predictive distribution is also computable with estimated .

## 3 Numerical experiments

We conduct numerical tests under two cases of synthetic observation data in order to illustrate the performance of the proposed method. The two synthetic datasets of elapsed time were generated from the Omori-Utsu and the Gutenberg-Richter laws with the parameters given in Table 1. Assume that the main shock is . As for the prior distribution for , is used, but, we put no prior information on since the resulting estimates were stable without any prior distributions on . The detection function in the two cases is

1. ;

2. (ii)

where the scale parameter is common within the two cases set to . In the first case, is simple and strictly decreasing function, but in the second case, it is more complex. This supposed in mind that a large aftershock excites many subsequent aftershocks as pointed out in Omi et al. (2013). In Figure 3, the two functions and occurrence rate calculated with synthetic data are shown. We first estimate the predictive distribution of and , and their credibility with data detected within 3, 6, 12, 24 hours. The number of detected aftershocks are 483, 668, 875, and 1135 at 3, 6, 12, and 24 hours in Case 1; 275, 539, 695, and 925 in Case 2. With the estimated , , and , is estimated by the method stated in Section 2.3.

Figure 4 shows estimated predictive distribution of . Mean of the predictive distribution is quite close to the true and inside a region within , where is the estimated standard error of . The resulting estimates with data (a) does not change so much with (d) , which shows robustness of our proposed method for small datasets. It is worth noting that even if mean of the prior distribution (green line) is far from the true one (black dashed line), the predictive distribution can estimate the true considerably well. The estimated and are also reported in Table 2. The posterior distribution includes the true -values even data with in both cases, though in Case 2, the estimated -value is somewhat underestimated, possibly due to the sampling bias. As the elapsed time increases, the center of posterior samples closes to the true value, and the length of the credible interval becomes shorter. Table 2 also shows that comparison of estimated with different two methods: (i) proposed method that replaces the detection function with predictive one in eq. (9); (ii) previous method that replaces with estimated in eq. (10). The estimates are very similar, but in Case 2, it seems that it fails to estimate and (or) values with data , while our proposed method gives stable estimates as expected in Section 2.3.

## 4 Real data analysis

The proposed method is applied to the real catalog data related to the 2004 Chuetsu earthquake officially released from the Japan Meteorological Agency (JMA). The 2004 Chuetsu earthquake (magnitude , epicenter  N,  E) occurred in Niigata prefecture, Japan. Figure 5 shows the spatial distribution of aftershocks that occurred within 24 hours from the main shock. The dataset is perfectly the same as used in Omi et al. (2015b), in which the aftershocks occurred in a rectangle area having the lengths four times the Utsu-Seki aftershock zone for latitudinal and longitudinal directions are selected. The Utsu-Seki aftershock zone is a rectangle region, in which the epicenter is located at the center, having the angle lengths of for both latitudinal and longitudinal directions, where is the Utsu-Seki aftershock zone length defined using the magnitude of the main shock as (Utsu, 1969). In the case of the 2004 Chuetsu earthquake, is .

Figure 6 shows the results of applications to the real catalog data. Following the same procedure in the numerical tests in Section 3, the proposed method estimates the mean of the predictive distribution with the standard error starting from the prior , assuming that the data available for 3, 6, 12, and 24 hours from the main shock. The number of aftershocks during each of the elapsed periods is 192, 355, 655, and 1099, respectively. The magnitudes of aftershocks are ranging from 0.8 to 6.6 by the elapsed time of 24 hours.

The is found to not simply decrease but turn to increase at beyond the estimated credible interval . One possible reason for this may be the fact that a large aftershock occurred at with the magnitude of 4.8. The proposed method can extract such a hidden structure without missing, owing to a stable estimation of .

The estimated by Omi et al. (2014) shown in Figure 6 for comparison seems to be unstable when available data were insufficient (Figures 6(a) and 6(b)), which is caused by some approximation methods for the marginal likelihood (supporting information of Omi et al. (2014)), although it is improved when the available data were extended to 24 hours. A similar phenomenon is also seen in Figure 3 in Omi et al. (2013). The proposed method has been improved to make the estimation stable excluding any approximation for the marginal likelihood (eq. 11). Figure 7(a) shows the estimated predictive detection function (eq. 13) in the case of the 2004 Chuetsu earthquake, and Figure 7(b) plots its cross-sections at the magnitudes , , , and . These magnitudes correspond to the minimum (th percentile), th percentile, median (th percentile), and th percentile in the catalog data, where th-percentile is the magnitude below which of all the aftershocks are found. Figure 7(b) indicates that aftershocks with magnitudes larger than 2.3 are completely detected after , except when the detection probability is slightly less than 1 probably due to a large aftershock mentioned above.

Table 2 summarizes mean and standard error of the posterior distribution of within elapsed time , , , and hours. The credible interval for the -value computed with the mean and standard error includes the true -value irrespective to the time available. On the other hand, the estimates of and with 24 hours are significantly small compared to the estimates with . This indicates our method is successful in detecting the changes of seismic activity due to the large aftershock at .

## 5 Concluding remarks

Immediate prediction of seismic activities after the main shock is important to assess hazards for subsequent aftershocks. Contaminations of arriving seismic waves right after the main shock interfere with counting the number of the aftershocks correctly. so that the detected number of the aftershocks is underestimated. This underestimated count of the aftershocks causes distorted estimates for the distribution of the aftershocks or the seismic activities. In order to solve this problem, we introduced a detection function for the aftershocks through GPR to remove the effects of undetected aftershocks. Owing to the nonparametric and Bayesian property of GPR, the proposed detection function has four advantages superior to the previous methods: (i) the resulting estimates are stable by virtue of a subjective prior; (ii) specification of the detection function is not required; (iii) MCMC sampling is effective to compute the hyperparameters without computation of complicated integration; (iv) credible intervals can be obtained in a natural way.

The limitation of the proposed method is in the assumption on the joint intensity function (eq. 5). It is known that real catalog data should be described by more complicated intensity functions such as represented by Epidemic Type Aftershock Sequence (ETAS) models (Ogata, 1988). The proposed method can be extended straightforwardly to ETAS models as done by Omi et al. (2014), which remains as a future work.

###### Acknowledgements.
This work was mainly supported by Tokyo Metropolitan Resilience Project of National Research Institute for Earth Science and Disaster Resilience (NIED), and partially supported by JST CREST Grant Numbers JPMJCR1763 and JPMJCR1761 of Japan Science and Technology Agency (JST). The key ideas in this study came through the activities of KAKENHI Grant-in-Aids for Early-Career Scientists No. 19K14592, 19K14671, 20K19756, Grant-in-Aids for Scientific Research (B) No. 17H01703, 17H01704, 18H03210, and Grant-in-Aids for Scientific Research (S) No. 19H05662. The travel expense needed to discuss among co-authors was partially supported by ERI JURP 2020-A-05. One of the figures was drawn by using the General Mapping Tools (GMT) developed by (Wessel and Smith, 1998).

## References

• K. Aki (1965) Maximum likelihood estimate of in the formula and its confidence limits. Bull. Earthq. Res. Inst. Univ. Tokyo 43, pp. 237–239. Cited by: §1.
• A. G. de G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani (2018)

Gaussian process behaviour in wide deep neural networks

.
arXiv preprint: 1804.11271. Cited by: §1.
• A. Genz and F. Bretz (2009) Computation of multivariate normal and probabilities. lecture notes in statistics.. Heidelberg: Springer. Cited by: §2.5.
• J. F. Geweke (1991) Efficient simulation from the multivariate normal and student- distributions subject to linear constraints. Computer Science and Statistics. Proceedings of the 23rd Symposium on the Interface. Seattle Washington, April 21-24, 1991, 571–578. Cited by: §2.5.
• J. F. Geweke (2005) Contemporary bayesian econometrics and statistics. Wiley & Sons. Cited by: §2.5.
• B. Gutenberg and C. F. Richter (1944) Frequency of earthquakes in california. Bull. Seism. Soc. Am. 34, pp. 185–188. Cited by: §1, §2.2.
• Δ. Y. Kagan and D. D. Jackson (2000) Probabilistic forecasting of earthquakes. Geophys. J. Int. 143, pp. 438–453. Cited by: §1.
• T. Kuwatani, H. Nagao, S. Ito, A. Okamoto, K. Yoshida, and T. Okudaira (2018) Recovering the past history of natural recording media by bayesian inversion. Phys. Rev. E 98, pp. 043311. Cited by: §2.1.
• N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, and E. Teller (1953) Equations of state calculations by fast computing machine. J. Chem. Phys. 21, pp. 1087–1091. Cited by: §2.5.
• Y. Ogata (1983) ESTIMATION of the parameters in the modified omori formula for aftershock frequencies by the maximum likelihood procedure. J. Phys. Earth 31, pp. 115–124. Cited by: §1, §2.2.
• Y. Ogata (1988) Statistical models for earthquake occurrences and residual analysis for point processes. J. Amer. Statist. Assoc. 83, pp. 29–39. Cited by: §5.
• Y. Ogata and K. Katsura (1993) Analysis of temporal and spatial hetero- geneity of magnitude frequency distribution inferred from earthquake catalogs. Geophys. J. Int. 113, pp. 727–738. Cited by: §1, §2.2, §2.2, §2.2, §2.4.
• Y. Ogata and K. Katsura (2006) Immediate and updated forecasting of aftershock hazard. Geophys. Res. Lett. 33, pp. L10305. Cited by: §1, §2.2, §2.2.
• T. Omi, Y. Ogata, Y. Hirata, and K. Aihara (2014) Estimating the etas model from an early aftershock sequence. Geophys. Res. Lett. 41, pp. 850–857. Cited by: §1, §2.2, §4, §5.
• T. Omi, Y. Ogata, Y. Hirata, and K. Aihara (2015a) Intermediate-term forecasting of aftershocks from an early aftershock sequence: bayesian and ensemble forecasting approaches. J. Geophys. Res. 120, pp. 2561–2578. Cited by: §1, §2.2.
• T. Omi, Y. Ogata, Y. Hirata, and K. Aihara (2013) Forecasting large aftershocks within one day after the main shock. Sci. Rep. 3, pp. 2218. Cited by: §1, §2.2, §2.2, §2.4, §2.5, §3, §4.
• T. Omi, Y. Ogata, K. Shiomi, B. Enescu, K. Sawazaki, and K. Aihara (2015b) Automatic aftershock forecasting: a test using real-time seismicity data in japan. Bull. Seismol. Soc. Am. 106, pp. 2450–2458. Cited by: §1, §2.2, §4.
• F. Omori (1894) On the aftershocks of earthquake. J. ColI. Sci. Imp. Univ. Tokyo 7, pp. 111–200. Cited by: §1, §2.2.
• K. B. Petersen and M. S. Petersen (2012) The matrix cookbook. Technical report, Technical University of Denmark, 2007. URL http://www2.imm.dtu.dk/pubdb/p.php?3274. Cited by: Appendix A.
• J. Qin (2017) Biased sampling, over-identified parameter problems and beyond. Singapore: Springer. Cited by: §1.
• C. E. Rasmussen, C. K. I. Williams, and K. I. Christopher (2006) Gaussian processes for machine learning. The MIT Press. Cited by: §1, §2.1.
• P. A. Resenberg and L. M. Jones (1989) Earthquake hazard after a mainshock in california. Science 243, pp. 1173–1176. Cited by: §1, §2.2.
• P. A. Resenberg and L. M. Jones (1994) On the estimation of seismic detection thresholds. Bull. Seismol. Soc. Am. 65, pp. 1631–1642. Cited by: §1, §2.2.
• F. Ringdal (1975) On the estimation of seismic detection thresholds. Bull. Seism. Soc. Am. 65, pp. 1631–1642. Cited by: §1, §2.2.
• M. A. Tanner and W. H. Wong (1987) The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82, pp. 528–540. Cited by: §2.5.
• T. Utsu (1961) A statistical study on the occurrence of aftershocks. J. ColI. Sci. Imp. Univ. Tokyo 30, pp. 521–605. Cited by: §1, §2.2.
• T. Utsu (1969) Aftershocks and earthquake statistics (1): some parameters which characterize an aftershock sequence and their interrelations. J. Fac. Sci., Hokkaido Univ., Ser. VII (Geophysics) 3, pp. 129–195. Cited by: §4.
• T. Utsu (1970) Aftershocks and earthquake statistics (ii)–further investigation of aftershocks and other earthquake sequences based on a new classification of earthquake sequences. J. Fac. Sci. Hokkaido Univ., Ser. 7, pp. 197–266. Cited by: §2.2.
• Y. Vardi (1982) Nonparametric estimation in presence of length bias. Ann. Stat. 10, pp. 616–620. Cited by: §1.
• Y. Vardi (1985) Empirical distributions in selection bias models. Ann. Stat. 13, pp. 178–203. Cited by: §1.
• P. Wessel and W. H. F. Smith (1998) New, improved version of generic mapping tools released. Trans. Am. Geophys. Un. 79, pp. 579. Cited by: §5.
• S. Wilhelm and B. G. Manjunath (2015) tmvtnorm: truncated multivariate normal and student t distribution. Note: R package version 1.4-10 External Links: Link Cited by: §2.5.

## Appendix A Theoretical results

In this section, we prove that each the marginal likelihood and the predictive distribution is computed as eq. (11) and eq. (12), respectively. For the proof of eq. (11), a detailed proof is given, but for eq. (12), only a sketch proof is because they are almost same. In the proofs, a formula sum of two squared forms (Petersen and Petersen, 2012, section 8.1.7) is repeatedly used. The formula says, for any vectors and , and nonsingular matrices and , it holds that

 −12(x−m1)⊤Σ−11(x−m1) −12(x−m2)⊤Σ−12(x−m2) (15) =−12(x−mc)⊤Σ−1c(x−mc)+C,

where

 Σ−1c = Σ−11+Σ−12 mc = (Σ−11+Σ−12)−1(Σ−11m1+Σ−12m2) C = 12m⊤cΣ−1cmc−12(m⊤1Σ−11m1+m⊤2Σ−12m2).

Proof of Eq. (11). The definition of the marginal likelihood is

 L(θ)=p(β)∫n∏i=1p(M1i∣μi;β,s2)p(μ∣μ0,K)dμi, (16)

where , is mean of the prior distribution, and is variance of the prior distribution which is an by matrix with th element . Hereafter we ignore because it does not have effect on integration. Recall that definition of the conditional distribution of given is given in eq. (7). It follows from

 exp(βμi)Φ(M1i;μi,s2) =exp(βμi)√2πs2∫M1i−∞exp{−(x−μi)22s2}dx =exp(s2β2/2)√2πs2∫M1i−∞exp[βx−{μi−(x+s2β)}22s2]dx

that the conditional distribution can be rewritten as

 n∏i=1p(M1i∣μi;β,s2) =βnexp{−βn∑i=1M1i−n2β2s2}n∏i=1exp(βμi)Φ(M1i;μi,s2) =βnexp(−β∑ni=1M1i)√(2π)n|Σ|∫x∈Mexp(β1⊤x) ×exp[−12{μ−(x+βΣ1)}⊤Σ−1{μ−(x+βΣ1)}]dx,

where , is by identity matrix, and . Letting , , , and with the formula (eq. 15), leads to

 L(θ) =βnexp(−β∑ni=1M1i)√(2π)n|Σ|√(2π)n|K|∫Mexp(β1⊤x) ×∫exp{−12(μ−m1)⊤Σ−11(μ−m1)} ×∫exp{−12(μ−m2)⊤Σ−12(μ−m2)}dμdx =βnexp(−β∑ni=1M1i)(2π)n√|Σ||K| ×∫Mexp(β1⊤x)√(2π)n|Σc|exp(C)dx.

Next, we compute , , and in the formula. It follows from the standard argument in linear algebra that

 Σc = (K−1+Σ−1)−1=K~K−1Σ, mc = Σc{Σ−1(x+βΣ1)+K−1μ0}, C = −12μ⊤0~K−1μ0−12(x+βΣ1)⊤~K−1(x+βΣ1) +μ⊤0(K+Σ)−1(x+βΣ1),

where . Rearranging the integrand so that it becomes a quadratic form of , we have

 L(θ) =βn√(2π)n|~K|exp{−βn∑i=1(M1i−μ0i)−β221⊤(Σ−K)1} ×∫Mexp{−12(x−~μ)⊤~K−1(x−~μ)}dx,

where . This is the desired conclusion.

Proof of Eq. (12). Let be any data point may not be in the dataset. What we need to compute is

 p(μ∗∣t∗1,D) =∫p(μ∗∣t∗1,μ,t1)p(μ∣t1,M1)dμ =∫p(μ∗∣t∗1,μ,t1)p(M1∣μ)p(μ∣t1)dμ∫p(M1∣μ)p(μ∣t1)dμ.

Here, denominator of the predictive distribution is exactly the same as the marginal likelihood which is already computed. Hence it remains to show the numerator becomes

 βnexp{−βn∑i=1(M1i−μ0,i)−12β2(ns2−∑i,jKi,j)}

It is computed by using the formula (eq. 15) with respect to with some tedious calculus.

Proof of Eq. (13). It follows from the result of eq. (12) that

 P(δ=1∣M∗1,t∗1,D) =∫P(δ=1∣M