1 Introduction
A massive earthquake triggers a number of aftershocks. The classical representative models to describe the temporal distribution of aftershocks were established as the OmoriUtsu (Omori, 1894; Utsu, 1961) and the GutenbergRichter (Gutenberg and Richter, 1944) formulae, the latter of which also considered the information of magnitudes. Ogata (1988) proposed, having extended the OmoriUtsu formula, the Epidemic Type Aftershock Sequence (ETAS) model to describe more realistically that large aftershocks also excite subsequent aftershocks like the main shock. The distribution of aftershocks enables us to forecast seismic activities and hazard assessments (Resenberg and Jones, 1989, 1994; Kagan and Jackson, 2000). Until now, previous studies have proposed many statistical methods to estimate the parameters involved in the models (Aki, 1965; Ogata, 1983, 1988). A disadvantage of these parameter estimation methods is that they assume the existence of complete data without missing. However, detecting all aftershocks immediately after the main shock is unrealistic due to contaminations of a tremendous amount of seismic waves. Such incomplete data cause underestimations in the counting of aftershocks.
In statistics, situations in which only some detected data being available are known as biased sampling problems (Vardi, 1982, 1985)
. In our case, the detection probability of aftershocks clearly depends on the magnitudes and elapsed times from the main shock. This type of biased sampling data is termed as missing not at random (MNAR), in which the detection probability depends on the values of undetected data. Introducing a detection function, which is a model of the detection probability, enables us to correct the bias
(Qin, 2017). Several studies have tackled this problem by introducing a parametric model on the detection function for aftershocks
(Ringdal, 1975; Ogata and Katsura, 1993, 2006; Omi et al., 2013, 2014, 2015a, 2015b). The detection function enables us to construct valid estimators in MNAR. Three problems remain to be solved: (i) the resulting estimators are often unstable; (ii) misspecification of the detection function causes bias; (iii) estimation of the detection function is difficult even with a correct model. The problem (i) arises from the simultaneous estimation of the detection function and the distribution of aftershocks. The problem (ii) is obvious because the bias correction strongly depends on how close the defined detection function to the true one is. The problem (iii) is because some integration is required in the likelihood, which makes estimations difficult in the biased sampling problems. In this study, we propose a nonparametric Bayesian estimator to overcome these three problems. Appropriate prior information considering characteristics of seismic activities in a target area improves the problem (i). A modeling of the detection function based on the technique of Gaussian Process Regression (GPR), which enables us to estimate an arbitrary continuous function from a given dataset without assuming a specific functional form, solves the problems (i) and (ii) simultaneously. The GPR has been accepted widely in the recent machine learning researches due to its flexibility and wide coverage of function spaces
(Rasmussen et al., 2006; de G. Matthews et al., 2018). As for computation of the model parameters, we propose an efficient Bayesian estimation algorithm utilizing the property that distributions of both the Gaussian process (GP) and aftershocks are exponential functions, which are compatible to compute solving the third problem (iii).Another advantage in the GPR is that it is capable of evaluating uncertainties or credible intervals of the estimated parameters naturally, which has been hard in the previous works in spite that knowing the uncertainty is inevitable to decide statistical decision. In summary, the proposed method can solve the three problems mentioned above and has an additional property that can estimate the uncertainties of the parameters.
This paper is organized as follows. Section 2 introduces the GPR, and proposes a method to estimate parameters for distribution of aftershocks with a detection function through the GPR. Section 3 validates the proposed method through numerical tests. Section 4 demonstrates the effectiveness of the proposed method by applying to the catalog data of the 2004 Chuetsu earthquake. Section 5 concludes the present study including perspective future works.
2 Methodology
The present paper proposes to use a detection function based on the GPR to model temporal changes in the detection probability of aftershocks even immediately after a main shock. This section first gives a brief explanation of the GPR, and then introduces the proposed method, especially its theoretical properties and an efficient computational algorithm.
2.1 Gaussian process regression
Recent studies in the solid Earth science used the GPR to construct models from given data in the cases that the physical or chemical process that produced the data was unknown or too complicated. The GPR estimates a regression function simultaneously with its uncertainty through Bayesian nonparametric estimation. For example, Kuwatani et al. (2018)
proposed to apply the GPR to interpolate the observed quantities of chemical compositions along with the radius of a rock. The estimated uncertainty often provides valuable information for observational or experimental design, such as a suggestion of times and/or places of the next new observations or measurements. The present study adopts the GPR to obtain a nonparametric regression function in the framework of Bayesian estimation.
We first briefly explain a GP, which the GPR bases. The fundamental statistics tells that, for a random variable, there exists a corresponding distribution that generates random “values”. The GP is an extension of this concept to a random function, i.e., there exists a corresponding distribution that generates random “functions”. The distribution of a GP is usually denoted as GP
, , where is the mean function andis the variance function or “kernel”. The notation like
is used in this paper to abbreviate arguments emphasizing thatis a function. A radial basis function is often chosen among various candidate functions for the variance function or kernel
:(1) 
where and are arbitrary real numbers. Another reason why this study adopts the radial basis function for the kernel is that it adequately covers an infinitedimensional function space with only a few hyperparameters. See Rasmussen et al. (2006) for the detailed explanation of the GP including other kernel functions. Figure 1(a) illustrates three random “functions” generated from a GP, with , and . For any points , where is an arbitrary positive integer, the GP is mathematically equivalent to computed from a sampled function
that follows a multivariate normal distribution with mean
and variance . Here means a matrix having in its th element, and the superscript means transposition. Note that the variance of the value at any point is, and triple of its standard error
produces approximately 99.7% confidence interval for the GP shaded region in red in Figure
1(a). Figure 1(b) indicates that the values of the function (black line in Figure 1(a)) at arbitrary points follows a multivariate normal distribution , where . Intuitively speaking, when the number of points goes to infinity, a set of points forms the function that is a sample from the distribution of the GP. This consideration indicates that a GP is a stochastic process obtained by letting the dimension of a multivariate normal distribution go to infinity.The GPR is a Bayesian estimation method for a target function by using a distribution of the GP as the prior information. Let and
be sets of explanatory and response variables, respectively, and both are assumed to relate each other through an unknown regression function
, i.e., , where denotes the number of time points. The GPR estimates the function from a given dataset assuming a kernel function mentioned above. We assume GP, as a prior distribution of the target function . The mean function is often assumed to be identically zero since the mean is adjustable by subtracting the sample mean of . Estimation of a function is equivalent to that of the value of at any fixed point , where the superscript “” is used to discriminate the fixed point from the data points. A posterior distribution of given a datasetis called the predictive distribution. The Bayes’ theorem yields the predictive distribution as
(2) 
where
is the conditional probability density function (PDF) of
with the given fixed point , unobserved , and the dataset , and the PDF is the prior on given by a multivariate normal distribution as mentioned above. The predictive distribution becomes a normal distribution again. Therefore, the predictive distribution for a point is a normal distribution with mean and variance computed as(3) 
where , and . The maximization of the marginal likelihood
(4) 
determines the hyperparameters , where the integration in the righthand side is explicitly computable since the integrand given as the product of normal distributions and is again a normal distribution. Figure 2 shows the predictive distributions estimated from six data points with changing hyperparameters and in the kernel (eq. 1). As mentioned above, the GPR successfully obtains not only the mean function but also its standard error i.e., credible interval. Figure 2 also indicates that and strongly associate with scale and shape of the regression function, respectively. A comparison between Figures 2(a) and 2(c), or 2(b) and 2(d) indicates that small/large means a small large/credible interval. Another comparison between Figure 2(a) and 2(b), or 2(c) and 2(d) shows that small/large means an oscillating/smoothed regression function. These results indicate the importance of deciding the hyperparameters.
2.2 Notation and models
According to the OmoriUtsu law, aftershock occurrence rate at elapsed time from the main shock follows a nonstationary Poisson process (Omori, 1894; Utsu, 1961):
(5) 
where
is a vector containing all the model parameters, i.e.,
. The parameter controls the level of seismic activity, i.e., large/small reflects the large/small number of aftershocks. The parameter is the slope of the occurrence rate in the logarithmic scale. The parameter characterizes the length of “capped time”, which means the wellknown phenomenon that the occurrence rate is below some level for a while immediately after the main shock (Utsu, 1961; Ogata, 1983). According to the GutenbergRichter law, the intensity rate of the magnitude is described by an exponential function (Gutenberg and Richter, 1944):(6) 
where and (or ) are constants. The parameter is of the most interest since it reflects the intensity rate of magnitude in the statistical meaning. Combining eqs. (5) and (6), the joint occurrence rate of aftershocks as a function of elapsed time and magnitude is represented by the product of and as
(7) 
where and is the magnitude of the main shock used to adjust the scale of (Utsu, 1970). Note that a unique decomposition into and is mathematically impossible, even if is obtained. Resenberg and Jones (1989, 1994) pointed out that an estimation of leads to forecast seismic activities, so that the unique decomposition problem does not matter in this context. Hereafter we use instead of for notational simplicity.
An exact count of all aftershocks right after a main shock is too tough due to contaminations of arriving seismic waves, so that the occurrence rate of aftershocks is almost always underestimated. In order to correct the bias, the present study adopts the probittype detection function as is used in Ringdal (1975), Ogata and Katsura (1993), Ogata and Katsura (2006), and Omi et al. (2013, 2014, 2015a, 2015b):
(8)  
where is a detection indicator that takes 1/0 if the aftershock is detected/undetected, is the magnitude that makes aftershock detectable with probability 50% at elapsed time , and is a scale parameter. Roughly speaking, the function is decreasing with respect to the elapsed time because only large aftershocks are detectable immediately after the main shock and even small aftershocks are detectable after enough time passes.
To distinguish the notations related to complete and detected data, let and be the elapsed time and magnitude of a detected aftershock, respectively. The subscript “1” indicates the detected data, i.e., . Note that and are always available though the complete data and are usually unavailable. Supposing that aftershocks are detected, let the pair of elapsed time from the main shock and magnitude of th aftershock be . The thinning operation or the random deletion in point processes (Ogata and Katsura, 1993) yields the likelihood function for the detected magnitudes
(9) 
where the function is assumed to be known, although it is estimated later in practice. Using the estimated and obtained by maximizing eq. (9), can be estimated by maximizing the loglikelihood function for detected elapsed times within any time interval (Ogata and Katsura, 1993),
(10) 
where is an intensity function for detected data defined by
(11)  
(12) 
This paper proposes a more efficient estimation method for in Section 2.4, obtained from an investigation of the problem in the estimations of and as mentioned in Section 2.3.
However, a serious problem that is unknown remains. In frequentist ways, Ogata and Katsura (1993) applied a Bspline basis function to , and Ogata and Katsura (2006) proposed a specific parametric model based on the 2003 MiyagiKenOki earthquake as
(13) 
where , , , and are parameters to be estimated. Recently, Omi et al. (2013) proposed a flexible nonparametric Bayesian estimation. They assumed a prior on , and computed the posterior mean. Their method does not require any specification of and it can naturally incorporate the prior information of , which considerably makes estimates stable. The posterior distribution of can be estimated in the same aftershock area that occurred before the main shock. However, their method never evaluates the credibility of or , which is inevitable to decide statistical decisions.
2.3 Estimation for , , and
Figure 3(a) shows a graphical model for complete data, which illustrates how the data, parameters, and hyperparameters relate to each other in accordance with the OmoriUtsu law, the GutenbergRichter law, and the detection function. Since the complete data are not available, we construct a likelihood based on the detected data given as shown in Figure 3(b). The distribution of given is already derived in eq. (9). Once the graphical model for the detected data was obtained, one would realize that relation among , , and is exactly the same as the regression: each input , the regression , and output corresponds to , , and , respectively. Based on this idea, we put a GP prior on and consider a nonparametric Bayesian estimation. However, unlike the GPR explained in Section 2.1, the distribution derived in eq. (9) is not a normal distribution, and the predictive distribution shall be more complicated.
In this study, we put a GP prior on the “function” , not on the “points” . The GP prior leads to an explicit form of the marginal likelihood of hyperparameters and posterior distribution of as seen in the following discussion. The prior for can be denoted as , where is defined as
(14) 
The prior should be a function except for identically zero in this case since adjusting the mean of the prior like in the GPR is impossible. Section 3 shows later that this simple prior works well through numerical experiments.
The first term in eq. (14) is added to the usual kernel eq. (1) to make the matrix always nonsingular. A subjectively small value, e.g., , is set to for a stable estimation of the other parameters. In summary, the parameters to be estimated are .
Let be the hyperparameter of the subjective prior for estimated in the same aftershock area that occurred before the main shock (Omi et al., 2013). The marginal likelihood of the hyperparameters and the predictive distribution of are computable as follows (see Appendix for the proofs).

Marginal likelihood of :
(15) where is the integration interval, is the direct product of the sets , , is the dimensional vector that has one in its all elements, , is the
dimensional identity matrix, and
is the PDF of the normal distribution with mean and variance . 
Predictive distribution of :
(16) (17) where
(18) , , and . The expectation is on a variable , where is the truncated multivariate normal distribution with mean , variance , bounded by the region , In particular, mean and variance of the predictive distribution are and , where
(19) The symbol “” explicitly represents that the variable depends on .

Predictive distribution of detection probability :
(20) where
is the cumulative distribution function of the standard normal distribution, i.e.,
.
The integrations in eqs. (17), (19), and (20) are computable by the Monte Carlo method since they involve expectations on the known truncated multivariate normal distribution . However, computation of the integration in the marginal likelihood (eq. 15) is more difficult. We provide an efficient computational algorithm to obtain the posterior samples of the hyperparameter , instead of maximizing the marginal likelihood (eq. 15) directly.
2.4 Estimation for
Recall that the loglikelihood function for is given in eq. (10), but and are unknown in eq. (11) and eq. (12). Ogata and Katsura (1993) and Omi et al. (2013) replaced with the predictive distribution (eq. 16) based on eq. (12). The strong point in our procedure is that the detection function in eq. (11) can be replaced with its predictive distribution already obtained in eq. (20). Our new approach using full information of the estimated detection function is expected to provide more efficient estimates, which is a significant advantage compared to the previous studies using only information of mean function .
By replacing the detection function (eq. 11) with the predictive one (eq. 20), we have
(21) 
Computation of the expectation by the distribution is possible with the Monte Carlo method. As for the integration of the second term in eq. (10), we use a left Riemann sum discretizing into meshes having the same interval.
2.5 Computational algorithm
Both the predictive distribution and the marginal likelihood involve integration on multivariate normal distribution over an dimensional hyperrectangle . Since an explicit computation of the integral is hard even when the dimension is low (Genz and Bretz, 2009), we divide the estimation into two steps: (i) estimation of hyperparameters ; (ii) computation of mean and variance of the predictive distribution.
In the step (i), the dataaugmentation method (Tanner and Wong, 1987) is applied to obtain samples from the posterior distribution of . Regarding the variable in eq. (15) as a latent variable yields to another representation of the marginal likelihood as
(22) 
where
(23) 
In accordance with the Gibbs sampling, augmented data can be sampled via the following two steps:

Draw ;

Draw ,
We apply the Gibbs sampling to obtain samples from the truncated normal distribution (Geweke, 1991, 2005), and use the package tmvtnorm of R programming language (Wilhelm and Manjunath, 2015) in the practical programming. As for sampling of
, the classical random walk MCMC (Markov Chain Monte Carlo) method
(Metropolis et al., 1953) works with a normal distributions as the proposal distribution. By repeating the two steps alternately, obtained becomes samples from posterior of . Hyperparameters are estimated as the median of obtained samples. Once samples of are obtained, can be estimated by its sample mean. Mean of the predictive distribution of is computable with estimated by given in eq. (18), and variance is by .3 Numerical experiments
We conduct numerical tests for the two cases of synthetic observation data to validate the performance of the proposed method. Table 1 summarizes the assumed true values for the parameters in the OmoriUtsu and the GutenbergRichter laws that are used to generate the two synthetic datasets. The manitude of the main shock is assumed to be .
The true detection functions modeled by eq. (8) in the two cases are

;

where the same scale parameter
is assumed for both cases. Figure 4(a) shows the true detection functions for both cases. Case 1 supposes an ordinary situation that the detectable magnitude of aftershocks monotonically decreases as the signaltonoise ratio recovers with elapsed time from the main shock. Case 2 supposes a complicated situation that several large aftershocks excite many subsequent aftershocks again, and the detectable magnitude turns to increase for every large aftershock
(Omi et al., 2013). We first estimate the predictive distribution of and , and their credibility with data detected within 3, 6, 12, 24 hours. With the estimated , , and , is estimated by the method stated in Section 2.3. We assume for the prior distribution of , and no prior information for . Actually, the resulting estimates of would be stable without any prior distributions.Case  

1  0.9  8.700  1.100  5.809 
2  0.9  8.987  1.100  5.809 
Figure 5 shows the estimated predictive distribution of . The mean of the predictive distribution is quite close to the true and inside a region within , where is the estimated standard error of . The resulting estimates with data (a) are almost consistent with the case of (d) , which shows the robustness of our proposed method even with small datasets. It is worth noting that even if the mean of the prior distribution (green line) is not so similar to the true one (black dashed line), the predictive distribution reproduces the true considerably well. Table 2 summarizes the estimated and comparing the proposed method (M1) with the method of Omi et al. (2014) (M2). The method M1 replaces the detection function with the predictive one in eq. (11), and the method M2 replaces with the estimated in eq. (12). The estimates obtained by both methods are almost consistent. But in Case 2, M2 seems to fail to estimate and/or with data , while M1 gives stable estimates as expected in Section 2.3. As the elapsed time increases, the mean of posterior samples closes to the true value, and the length of the credible interval becomes shorter. Estimated credible intervals, computed by estimated mean 1.96 estimated standard error, successfully include the true value even data with in both cases. The reason why the estimated values in Case 2 are underestimated for all periods of available data is not due to the imperfectness of the proposed method but due to the sampling bias.
Case 1  Case 2  
Elapsed time  
# of data  483  668  875  1135  275  539  695  925  
0.846  0.860  0.889  0.901  0.846  0.850  0.851  0.859  
( 0.047)  ( 0.043)  ( 0.038)  ( 0.037)  ( 0.053)  ( 0.044)  ( 0.040)  ( 0.038)  
M1  5.482  4.902  4.342  3.703  5.193  4.320  3.753  3.083  
( 0.304)  ( 0.160)  ( 0.094)  ( 0.057)  ( 0.317)  ( 0.140)  ( 0.104)  ( 0.062)  
1.158  1.155  1.108  1.081  1.103  1.022  1.063  1.055  
( 0.111)  ( 0.068)  ( 0.045)  ( 0.031)  ( 0.119)  ( 0.062)  ( 0.053)  ( 0.037)  
5.616  5.649  5.844  5.950  5.838  6.298  6.052  6.057  
( 0.437)  ( 0.342)  ( 0.288)  ( 0.262)  ( 0.574)  ( 0.462)  ( 0.419)  ( 0.350)  
M2  5.476  4.906  4.513  3.701  5.192  4.316  3.648  3.077  
( 0.302)  ( 0.160)  ( 0.077)  ( 0.057)  ( 0.315)  ( 0.142)  ( 0.085)  ( 0.058)  
1.156  1.157  1.061  1.081  1.102  1.022  0.885  1.052  
( 0.110)  ( 0.068)  ( 0.035)  ( 0.031)  ( 0.118)  ( 0.063)  ( 0.037)  ( 0.034)  
5.621  5.644  6.443  5.949  5.849  6.303  7.554  6.072  
( 0.437)  ( 0.342)  ( 0.310)  ( 0.250)  ( 0.583)  ( 0.457)  ( 0.492)  ( 0.385) 
4 Real data analysis
The proposed method is applied to the real catalog data related to the 2004 Chuetsu earthquake officially released from the Japan Meteorological Agency (JMA). The 2004 Chuetsu earthquake (magnitude , epicenter N, E) occurred in Niigata prefecture, Japan at 17h56m on October 23, 2004 (JST). Figure 6 shows the spatial distribution of aftershocks that occurred within 24 hours from the main shock. The dataset is perfectly the same as used in Omi et al. (2015b), in which the aftershocks occurred in a rectangle area having the lengths four times the UtsuSeki aftershock zone for latitudinal and longitudinal directions are selected. The UtsuSeki aftershock zone is a rectangle region, in which the epicenter is located at the center, having the angle lengths of for both latitudinal and longitudinal directions, where is the UtsuSeki aftershock zone length defined using the magnitude of the main shock as (Utsu, 1969). In the case of the 2004 Chuetsu earthquake, is .
Figure 7 shows the results of applications to the real catalog data. Following the same procedure in the numerical tests in Section 3, the proposed method estimates the mean of the predictive distribution with the standard error starting from the prior , assuming that the data available for 3, 6, 12, and 24 hours from the main shock. The magnitudes of aftershocks are ranging from 0.8 to 6.6 by the elapsed time of 24 hours.
The is found not to simply decrease but turn to increase at beyond the estimated credible interval . One possible reason for this may be the fact that a large aftershock occurred at with the magnitude of 4.8. The proposed method can extract such a hidden structure without missing, owing to a stable estimation of .
The estimated by Omi et al. (2014) shown in Figure 7 for comparison seems to be unstable when available data are insufficient (Figures 7(a) and 7(b)), which is caused by some approximation methods for the marginal likelihood (supporting information of Omi et al. (2014)), although it is improved when the available data were extended to 24 hours. A similar phenomenon is also seen in Figure 3 in Omi et al. (2013). The proposed method always obtains a stable estimation even when the elapsed time is short (Figures 7(a) and 7(b)) owing to the exclusion of any approximations for the marginal likelihood (eq. 15). Figure 8(a) shows the estimated predictive detection function (eq. 20) in the case of the 2004 Chuetsu earthquake, and Figure 8(b) plots its crosssections at the magnitudes , , , and . These magnitudes correspond to the minimum (th percentile), th percentile, median (th percentile), and th percentile in the catalog data, where thpercentile is the magnitude below which of all the aftershocks are found. Figure 8(b) indicates that aftershocks with magnitudes larger than 2.3 are completely detected after , except when the detection probability is slightly less than 1 probably due to a large aftershock mentioned above.
Table 3 summarizes the mean and the standard error of the posterior distribution of within elapsed time , , , and hours. The estimates for the value are stable irrespective to the time available. On the other hand, the estimates of and with 24 hours are significantly small compared to the estimates with . This indicates that our method is successful in detecting the changes of seismic activity due to the large aftershock at .
Elapsed time  

# of data  192  355  655  1099 
0.787  0.745  0.794  0.779  
( 0.066)  ( 0.051)  ( 0.045)  ( 0.035)  
3.795  3.129  2.822  1.658  
( 0.438)  ( 0.264)  ( 0.117)  ( 0.045)  
1.529  1.484  1.537  1.270  
( 0.243)  ( 0.188)  ( 0.123)  ( 0.045)  
3.573  3.589  3.509  3.987  
( 0.478)  ( 0.492)  ( 0.391)  ( 0.362) 
5 Concluding remarks
Immediate prediction of seismic activities after the main shock is important to assess hazards for subsequent aftershocks. Contaminations of arriving seismic waves right after the main shock interfere with counting the number of the aftershocks correctly, so that the number of detected aftershocks is underestimated. This underestimated count of the aftershocks causes distorted estimates for the distribution of the aftershocks or the seismic activities. In order to rapidly and stably estimate the temporal changes in the occurrence rate of aftershocks with limited data right after a main shock, we introduced a GPRbased detection function for the aftershocks to remove the effects of undetected aftershocks. Owing to the nonparametric and Bayesian properties in the GPR, the proposed detection function has four advantages superior to the previous methods: (i) the resulting estimates are stable by virtue of adoption of prior information; (ii) specification of the detection function is not necessary; (iii) MCMC sampling is effective to compute the hyperparameters without computation of complicated integration; (iv) credible intervals can be obtained in a natural way.
The limitation of the proposed method is in the assumption on the joint intensity function (eq. 7). It is known that real catalog data should be described by more complicated intensity functions such as represented by the ETAS model (Ogata, 1988). The proposed method can be extended straightforwardly to the ETAS model as done by Omi et al. (2014), which remains as a future work.
Acknowledgements.
This work was mainly supported by Tokyo Metropolitan Resilience Project of National Research Institute for Earth Science and Disaster Resilience (NIED), and partially supported by JST CREST Grant Numbers JPMJCR1763 and JPMJCR1761 of Japan Science and Technology Agency (JST). The key ideas in this study came through the activities of KAKENHI GrantinAids for EarlyCareer Scientists No. 19K14592, 19K14671, 20K19756, GrantinAids for Scientific Research (B) No. 17H01703, 17H01704, 18H03210, and GrantinAid for Scientific Research (S) No. 19H05662. The travel expense needed to discuss among coauthors was partially supported by ERI JURP 2020A05. One of the figures was drawn by using the General Mapping Tools (GMT) developed by Wessel and Smith (1998).References
 Maximum likelihood estimate of in the formula and its confidence limits. Bull. Earthq. Res. Inst. Univ. Tokyo 43, pp. 237–239. Cited by: §1.

Gaussian process behaviour in wide deep neural networks
. arXiv preprint: 1804.11271. Cited by: §1.  Computation of multivariate normal and probabilities. lecture notes in statistics.. Heidelberg: Springer. Cited by: §2.5.
 Efficient simulation from the multivariate normal and student distributions subject to linear constraints and the evaluation of constraint probabilities. Computer Science and Statistics. Proceedings of the 23rd Symposium on the Interface. Seattle Washington, April 2124, 1991, 571–578. Cited by: §2.5.
 Contemporary bayesian econometrics and statistics. Wiley & Sons. Cited by: §2.5.
 Frequency of earthquakes in California. Bull. Seism. Soc. Am. 34, pp. 185–188. Cited by: §1, §2.2.
 Probabilistic forecasting of earthquakes. Geophys. J. Int. 143, pp. 438–453. Cited by: §1.
 Recovering the past history of natural recording media by Bayesian inversion. Phys. Rev. E 98, pp. 043311. Cited by: §2.1.
 Equations of state calculations by fast computing machine. J. Chem. Phys. 21, pp. 1087–1091. Cited by: §2.5.
 ESTIMATION of the parameters in the modified omori formula for aftershock frequencies by the maximum likelihood procedure. J. Phys. Earth 31, pp. 115–124. Cited by: §1, §2.2.
 Statistical models for earthquake occurrences and residual analysis for point processes. J. Amer. Statist. Assoc. 83, pp. 29–39. Cited by: §1, §5.
 Analysis of temporal and spatial heterogeneity of magnitude frequency distribution inferred from earthquake catalogs. Geophys. J. Int. 113, pp. 727–738. Cited by: §1, §2.2, §2.2, §2.2, §2.4.
 Immediate and updated forecasting of aftershock hazard. Geophys. Res. Lett. 33, pp. L10305. Cited by: §1, §2.2, §2.2.
 Estimating the ETAS model from an early aftershock sequence. Geophys. Res. Lett. 41, pp. 850–857. Cited by: §1, §2.2, §3, Figure 7, §4, §5.
 Intermediateterm forecasting of aftershocks from an early aftershock sequence: bayesian and ensemble forecasting approaches. J. Geophys. Res. 120, pp. 2561–2578. Cited by: §1, §2.2.
 Forecasting large aftershocks within one day after the main shock. Sci. Rep. 3, pp. 2218. Cited by: §1, §2.2, §2.2, §2.3, §2.4, §3, §4.
 Automatic aftershock forecasting: a test using realtime seismicity data in Japan. Bull. Seismol. Soc. Am. 106, pp. 2450–2458. Cited by: §1, §2.2, §4.
 On the aftershocks of earthquake. J. ColI. Sci. Imp. Univ. Tokyo 7, pp. 111–200. Cited by: §1, §2.2.
 The matrix cookbook. Technical report, Technical University of Denmark, 2007. URL http://www2.imm.dtu.dk/pubdb/p.php?3274. Cited by: Lemma 1.
 Biased sampling, overidentified parameter problems and beyond. Singapore: Springer. Cited by: §1.
 Gaussian processes for machine learning. The MIT Press. Cited by: §1, §2.1.
 Earthquake hazard after a mainshock in California. Science 243, pp. 1173–1176. Cited by: §1, §2.2.
 Earthquake aftershocks: Update. Science 265, pp. 1251–1252. Cited by: §1, §2.2.
 On the estimation of seismic detection thresholds. Bull. Seism. Soc. Am. 65, pp. 1631–1642. Cited by: §1, §2.2.
 The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82, pp. 528–540. Cited by: §2.5.
 A statistical study on the occurrence of aftershocks. J. ColI. Sci. Imp. Univ. Tokyo 30, pp. 521–605. Cited by: §1, §2.2.
 Aftershocks and earthquake statistics (I) — Some parameters which characterize an aftershock sequence and their interrelations —. J. Fac. Sci., Hokkaido Univ., Ser. VII (Geophysics) 3, pp. 129–195. Cited by: §4.
 Aftershocks and earthquake statistics (II) — Further investigation of aftershocks and other earthquake sequences based on a new classification of earthquake sequences —. J. Fac. Sci. Hokkaido Univ., Ser. VII (Geophysics) 3, pp. 197–266. Cited by: §2.2.
 Nonparametric estimation in presence of length bias. Ann. Stat. 10, pp. 616–620. Cited by: §1.
 Empirical distributions in selection bias models. Ann. Stat. 13, pp. 178–203. Cited by: §1.
 New, improved version of generic mapping tools released. Trans. Am. Geophys. Un. 79, pp. 579. Cited by: §5.
 tmvtnorm: truncated multivariate normal and student t distribution. Note: R package version 1.410 External Links: Link Cited by: §2.5.
Appendix A Theoretical results
Lemma 1
(Formula of sum of two squared forms (Petersen and Petersen, 2012, section 8.1.7)).
For any vectors and , and nonsingular matrices and , it holds that
(24) 
where
(25)  
(26)  
(27) 
Proof of Eq. (15). The definition of the marginal likelihood is
(28) 
where , is mean of the prior distribution, and is variance of the prior distribution which is an by matrix. Hereafter we ignore because it does not have effect on integration. Recall that definition of the conditional distribution of given is given in eq. (9). It follows from
(29) 
that the conditional distribution can be rewritten as
where , is by identity matrix, and . Letting , , , and with the formula (eq. 24), leads to
(30) 
Next, we compute , , and in the formula. It follows from the standard argument in linear algebra that
Comments
There are no comments yet.