Predicting Infection of COVID-19 in Japan: State Space Modeling Approach

04/28/2020 ∙ by Genya Kobayashi, et al. ∙ 0

The number of confirmed cases of the coronavirus disease (COVID-19) in Japan has been increasing day by day and has had a serious impact on the society especially after the declaration of the state of emergency on April 7, 2020. This study analyzes the real time data from March 1 to April 22, 2020 by adopting a sophisticated statistical modeling tool based on the state space model combined with the well-known susceptible-exposed-infected (SIR) model. The model estimation and forecasting are conducted using the Bayesian methodology. The present study provides the parameter estimates of the unknown parameters that critically determine the epidemic process derived from the SIR model and prediction of the future transition of the infectious proportion including the size and timing of the epidemic peak with the prediction intervals that naturally accounts for the uncertainty. The prediction results under various scenarios reveals that the temporary reduction in the infection rate until the planned lifting of the state on May 6 will only delay the epidemic peak slightly. In order to minimize the spread of the epidemic, it is strongly suggested that an intervention is carried out for an extended period of time and that the government and individuals make a long term effort to reduce the infection rate even after the lifting.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since the first case of the coronavirus disease 2019 (COVID-19) in Japan was confirmed on January 15, 2020, the number of confirmed cases has been increasing day by day. Although the Japanese government declared a state of emergency on April 7, it does not have a legal force to regulate individual activities and remains at only requesting the avoidance of outings. Consequently, the number of cases has been still increasing as shown in Figure 1. Needless to say, Japanese economy has been seriously shocked and the public interest mainly lies on how the number of infected persons transits in the future and when the outbreak will converge. Although there already exists a rapidly increasing number of statistical analyses of the epidemic, the statistical evidence focusing on the situations in Japan is still limited except for Kuniya (2020); Karako et al. (2020); Mizumoto et al. (2020). Therefore, the purpose of this study is to provide a statistical evidence regarding the future transition of the infectious proportion in Japan, including the intensity and timing of the epidemic peak, based on the real-time data on the cumulative number of confirmed, recovered and deceased persons, shown in Figure 1.

We consider the famous susceptible-infected-recovered (SIR) model Kermack and McKendrick (1927) for modeling the epidemic process as widely adopted in the existing literature on COVID-19. However, this deterministic model is not necessarily sufficient to explain the variability of the transition since the observed number is subject to nonignorable randomness. To handle such randomness in the data, we employ the state spate models combined with the SIR model (SS-SIR model) developed by Dukic et al. (2012); Osthus et al. (2017). The model was originally proposed for statistical modeling of the seasonal trend of influenza. The advantages of the SS-SIR model are mainly three points; (1) the unknown parameters in the SIR model can be estimated with little knowledge about the true values by adequately using the data information; (2) future prediction of a variety of quantities such as the number of infections or the epidemic peak as well as uncertainty quantification of the prediction can be carried out easily; (3) whether the real-time data follows the assumed SIR model or not can be assessed through the parameter estimate. These advantages are quite essential because (1)information required for modeling the epidemic trend of a new virus is scarce, (2) it is important to compute not only point prediction but also interval prediction to understand the possible worst and best scenarios of future transition, and (3) understanding if the real-time data actually follows the SIR model is critical for the reliability of future simulations based on the SIR model.

Figure 1: The cumulative numbers of confirmed, recovered and deceased persons in Japan.

2 Methods

2.1 Data

We use the numbers of confirmed, recovered and deceased persons collected on an open source platform ( Although the original data starts from January 22, the numbers before the end of February are treated collectively. This is because the confirmed numbers in this period are relatively small and using the data after March 2020 would be useful to reliably predict the future numbers of infectious persons after May 2020. Hence, the period of the data used in our analysis consists of days from March 1 to April 22. We use the difference between the cumulative numbers of confirmed persons and recovered plus deceased persons, denoted by for , which can be interpreted as the number of confirmed persons being infectious. It is further assumed that only fraction of infectious individuals can be identified by diagnosis, which is called identification rate hereafter. Then we define as where is the population of Japan, thereby is the proportion of the infectious population at time . Regarding the specific values of , we follow the discussion in Kuniya (2020). Since Bloomberg (Bloomberg) reported that 77 persons were confirmed among the possible 940 infected population, the confidence interval of is . Based on this argument, the results under the following three scenarios , and are compared.

2.2 Statistical model

Here the model proposed by Osthus et al. (2017) is described. Let , and denote the proportions of individuals being susceptible, infected and recovered population at the time , respectively, satisfying

. The SIR model describes the epidemic over time via the nonlinear ordinary differential equations (ODE) given by


where the unknown infection rate and removal rate control the transition from one compartment to the next and jointly determine the epidemic process. Let

define the three-dimensional vector of the unobserved true proportion at the time

. To allow randomness in the evolution of , the following model is considered:


where Dir denotes the Dirichlet distribution, is the solution of the deterministic SIR model (1) starting the ODE at and is the unknown parameter controlling the randomness in the evolution. In the above model, the conditional expectation of given the previous state is , so the distribution of is centered around the deterministic model (1

). It is noted that the conditional variance of

decreases as increases, thus the validity of the assumption of the deterministic model (1) can be verified through the estimate of .

Let be the observed value of . Since is not necessarily equal to the true , is observed based on the following probabilistic model:


where Beta

denotes the Beta distribution and

is an unknown parameter having a similar role to in (2). The statistical model for with the combination of (2) and (3) is seen as a state space model.

The unknown parameters in the model are the two parameters and in the SIR model and two scale parameters and that control the randomness in the two equations (2) and (3). The estimation of these parameters and future prediction is conducted within the Bayesian framework in which we assign prior distributions for these parameters and compute the posterior distribution via the Bayes rule. Due to the complexity of the model, the analytical derivation of the posterior distribution is not feasible. Instead, we rely on the simulation-based method known as Markov Chain Monte Carlo (MCMC) algorithm (Gamerman and Lopes, 2006) to generate random numbers from the posterior distribution. Then the parameter estimates are calculated and future prediction is carried out based on the output of the MCMC algorithm. Regarding the prior distributions, we assign slightly non-informative priors to reflect the uncertainty about the new epidemic and let the data tell the truth adequately. The details of the settings of the prior distributions and algorithm are provided in Supplementary Material.

3 Results

3.1 Prediction of epidemic peak

The SS-SIR model is applied to the Japanese data with the three identification rates . First, we found that the estimates of the precision parameters and are very large. For example, the point estimates are and for indicating that the deterministic SIR model explains the transition of the real-time data well. Table 1 reports the estimates and credible intervals of the representative parameters. Under the three settings for , the point estimates of are between and and those of are between and . The estimates of the basic reproduction number are between and . For , for example, the 95% credible interval of is (1.22–1.64). The estimates of PI and PT appear to vary depending on the identification rate. Figure 2 reports the future predictions of the proportion of the infectious proportion under the three identification rates. The figure allows us to easily understand the degree of uncertainty in prediction, and worst and best scenarios for the future epidemic process through the interval prediction. It is seen that the predicted timing of the epidemic peak and peak intensity depend on the identification rate through the differences in the estimates of PT and PI. Specifically, the point predictions of the trajectory of the infectious proportion have the timing of the peak on July 12, July 23 and July 30 with the intensities and 95% prediction intervals of 3.81% (1.30%–7.19%), 2.7% (0.48%–6.23%) and 2.09% (0.15%–6.04%) for , and , respectively. The sensitivity of prediction results with respect to was also found in Kuniya (2020), but that under our setting of is far less dramatic. Moreover, all the scenarios predict that the epidemic peak comes during the summer 2020. This result is also consistent with Kuniya (2020).

3.2 Effect of intervention

On April 7, 2020, the Japanese government declared a state of emergency aiming at reducing human contacts by , which is considered to be sufficient to terminate the epidemic. However, the government reports that the actual reduction is still limited to around or (, mainly because the state does not have a legal force to regulate individual activities. Also the Japanese government plans to lift the state on May 6, but the public concern lies on whether such a short period of the state of emergency is sufficient or not.

Through simulation, we here assess the efficacy of further intervention and public awareness on mitigating the infection risk under various scenarios. Specifically, we consider various settings for the degrees of reduction in human contacts that are achieved by the government during the intervention and by the public awareness after the intervention, and the period of intervention denoted by , under the state of emergency and predict the future epidemic transitions. Here, we focus on . The results under and are found in Supplementary Material. It is recognized that the realization of the effect of reducing human contacts takes about two weeks since the incubation period of COVID-19 is at most 2 weeks as reported by World Health Organization. Since April 22, the last date in the real-time data, is almost two weeks after the declaration of the state of emergency, we assume that the infection rate changes from to from April 23. For the degree of reduction in human contacts, the following six scenarios are considered: , and If reduction of human contacts was achieved, the reality would have corresponded to or . In view of the current situation, however, or would be closer to the reality. We also suppose that the intervention will continue for days from April 23 with the three scenarios, and . Note that corresponds to May 6 on which the government is planning to lift the state. The other two dates to respectively correspond to the two-week and one-month extension of the intervention that continue until May 20 and June 6, respectively. We further suppose that the infection rate becomes after the intervention period with the three scenarios: , and . The first scenario implies that the level of human mobility after the intervention returns to the original level before the intervention. The latter two scenario can reflect the remaining strain in the public awareness on mitigating the spread of infection through, for example, voluntary avoidance of outings and social distancing.

Figure 3 presents the nine panels on the future prediction under the combinations of the three scenarios of each and . Comparing the different scenarios of , the figure reveals that setting to smaller values is effective only when it is combined with larger . For example, the left upper panel of Figure 3 exhibits little differences among the six choices of when and . Contrary, the small values of such as with and can lead to a convergence of the epidemic. Under and , the epidemic can be terminated in terms of point prediction when , while the epidemic peak belatedly comes on September 8, 2020 with 2.4% of the peak intensity when . The result suggests that the termination of the intervention due to the currently planned lifting of the state of emergency on May 6 is too early and would only result in a slight delay in the epidemic peak and a slight reduction in the peak intensity.

The degree of reduction in after the intervention, , also has a dramatic effect on the consequence of the epidemic. The upper panels of Figure 3 show that the efficacy of the temporary reduction in under the intervention can be quite limited if returns to the original level after the intervention. In contrast, if at least reduction in can be achieved for a sufficiently long period of time after the intervention, the epidemic can be effectively suppressed. In the case of , for example, the peak intensity is more than halved to 1.21% with the peak on September 16 even under the mild degree of intervention for a short period of time ( and ). When a longer intervention is carried out, the peak is further delayed to November 14 with % of the peak intensity. Furthermore, in the case of , the figure shows the epidemic is almost completely suppressed in terms of point prediction regardless of the degree of intervention and length of intervention period. To summarize, our results show that not only the degree of reduction in during the intervention but also and more importantly the length of intervention and the long term level of after the intervention is critical to control the spread of the epidemic.

Estimate (95% interval)
Parameter Description
Infection rate 0.21 (0.13–0.34) 0.23 (0.13–0.43) 0.25 (0.13–0.55)
Removal rate 0.14 (0.08–0.25) 0.16 (0.09–0.34) 0.18 (0.09–0.45)
Basic reproduction number 1.48 (1.30–1.69) 1.43 (1.22–1.64) 1.41 (1.19–1.65)
PT Peak timing 145 (102–225) 161 (99–252) 157 (91–265)
PI() Peak intensity 4.48 (1.80–7.80) 3.67 (0.99–7.02) 3.34 (0.68–7.19)
Table 1: Estimates and credible intervals of parameters of the SS-SIR model under the three identification rates .
Figure 2: Results of the prediction of the proportion of the infectious population with (left), (center) and (right). The observed data points are shown by the black dots.
Figure 3: Future prediction under the nine combinations of (the period of the intervention) and (the multiplier for after the intervention) for . The red, black and grey curves respectively represent the future point prediction without intervention shown in Figure 2, point prediction under each scenario and one-sided upper prediction intervals.

4 Discussion

In this research, we have employed the probabilistic version of the famous SIR model, called SS-SIR model, to model the real-time data on the infectious population of COVID-19 in Japan. The advantage of the SS-SIR model is that we can obtain not only future point prediction but also uncertainty quantification through, for example, the future prediction intervals. The basic reproducing number is estimated to be approximately between 1.4 and 1.5 in this study. This is smaller than the estimate of in Kuniya (2020) obtained from the SEIR model applied to the early stage data in Japan. Note, however, that Kuniya (2020) did not estimate the removal rate and onset rate but fixed their values to those found in the existing studies. We also estimated using the subset of the data up to April 6 for and the estimate is with the credible interval (1.22–1.66). Therefore, our estimates for remain unchanged even when the observations after the state of emergency are excluded. Moreover, our estimate in the case of Japan is also smaller than those reported from the case studies in China (e.g. Imai, Cori, Dorigatti, Baguelin, Connelly, Riley, and Ferguson, Imai et al.; Liu et al., 2020; Liu et al., 2020; Tang et al., 2020; Wu et al., 2020). Our result may have reflected the fact that the number of cases in Japan does not increase as rapidly as other countries (,

Through the future prediction under various scenarios on the possible reduction in the infection rate and the length of the intervention period, we have obtained the following epidemiological insights:

  • Even if a large reduction in could be achieved during the intervention period (e.g. the state of emergency), the convergence of the epidemic can depends on the long term value of after the intervention.

  • As long as the value of can be maintained to be slightly smaller even after the intervention period than that before the intervention, there is a great possibility that the epidemic terminates with a significantly smaller epidemic size than the case without intervention.

  • The effort to reduce should be combined with a sufficiently long intervention period.

These findings suggest that the lifting of the state of emergency on May 6 planned by the Japanese government will unfortunately have a too limited effect for the economic and social disturbances caused by the epidemic and state. A long term effort to tackle this situation is indispensable.


This research was supported by Japan Society for the Promotion of Science (KAKENHI) Grant Numbers 18K12754 and 18K12757.


  • Bloomberg [Bloomberg] Bloomberg. Japan’s hokkaido may have 940 infected, researcher says available online: (3 march 2020).
  • Dukic et al. [2012] Dukic, V., H. F. Lopes, and N. G. Polson (2012). Tracking epidemics with google flu trends data and a state-space seir model. Journal of the American Statistical Association 107, 1410–1426.
  • Gamerman and Lopes [2006] Gamerman, D. and H. F. Lopes (2006).

    Matkov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition

    Chapman and Hall/CRC.
  • Imai, Cori, Dorigatti, Baguelin, Connelly, Riley, and Ferguson [Imai et al.] Imai, N., A. Cori, I. Dorigatti, M. Baguelin, C. Connelly, S. Riley, and N. Ferguson. Report 3: Transmissibility of 2019-ncov; imperial college london: London, uk, 2020.
  • Karako et al. [2020] Karako, K., P. Song, Y. Chen, and W. Tang (2020). Analysis of covid-19 infection spread in japan based on stochastic transition model. BioScience Trends, to appear.
  • Kermack and McKendrick [1927] Kermack, W. and A. McKendrick (1927). Contribution to the mathematical theory of epidemics. Proceedings of the Royal Society of London, Series A 115, 700–721.
  • Kuniya [2020] Kuniya, T. (2020). Prediction of the epidemic peak of coronavirus disease in japan, 2020. Journal of Clinical Medicine 9, 789.
  • Liu et al. [2020] Liu, T., J. Hu, M. Hang, L. Lin, H. Zhong, J. Xiao, G. He, T. Song, Q. Huang, Z. Rong, and et al (2020). Transmission dynamics of 2019 novel coronavirus (2019-ncov). bioRxive.
  • Liu et al. [2020] Liu, Y., A. A. Gayle, A. Wilder-Smith, and J. Rocklov (2020). The reproductive number of covid-19 is higher compared to sars coronavirus. Journal of Travel Medicine 27, taaa021.
  • Mizumoto et al. [2020] Mizumoto, K., K. Kagaya, A. Zarebski, and G. Chowell (2020). Estimating the asymptomatic proportion of coronavirus disease 2019 (covid-19) cases on board the diamond princess cruise ship, yokohama, japan, 2020. Eurosurveillance 25.
  • [] Coronavirus cases by country (24 april 2020).
  • Osthus et al. [2017] Osthus, D., K. Hickmann, P. C. Caragea, D. Higdon, and S. Y. Del Valle (2017). Forecasting seasonal influenza with a state-space sir model. The Annals of Applied Statistics 11, 202–204.
  • Tang et al. [2020] Tang, B., X. Wang, Q. Li, N. Bragazzi, S. Tang, Y. Xiao, and J. Wu (2020). Estimation of the transmission risk of the 2019-ncov and its implication for public health interventions. Journal of Clinical Medicine 9, 462.
  • Wu et al. [2020] Wu, J., K. Leung, and G. Leung (2020). Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: A modelling study. Lancet 395, 689–697.

S1 Model

The state space SIR (SS-SIR) model consists of the following two equations:


for with in our analysis, where , and are the unknown infection rate and removal rate, respectively, and is the solution of the deterministic SIR model starting at :

S2 Prior distributions

s2.1 Precision parameters and

We set and , independently.

s2.2 Initial state

The joint prior is constructed via the following decomposition:

The prior distributions of and are first determined and then the prior distribution of is determined accordingly. We assume that of population is initially susceptible, , and follows a beta distribution. The parameters of the beta distribution is set such that for and for and , and for all the cases.

s2.3 Key parameters and in the SIR model

The prior distributions of the two important parameters and through the prior distributions of , peak intensity (PI) and timing of the peak intensity (PT). The prior distribution of , reciprocal of the basic reproduction number, is firstly derived. If , starts increasing, reaches its maximum and decreases to zero as so the model is designed as epidemic. Under the SIR model, PI can be expressed as

and it is known that the unique solution exists. Then, the prior distribution of is determined by specifying the prior distribution of PI. We assume , that is, the prior peak intensity is centered around 3% of the population. Similarly, the prior distribution of PT is given by , implying the prior timing of the epidemic peak is between April 22, 2020 and April 17, 2021, and is centered around August 27, 2020.

Next, the prior distribution of is specified following the regression approach of [12]. We first prepare the grids on the space of , PI and as follow:

  • : equally spaced 40 points between and ,

  • PI: points equally spaced between and by ,

  • I(0): equally spaced 20 points between and .

The SIR curves are simulated for all the combinations of and PT is identified. Then is regressed on a subset of a fourth degree polynomial interaction model using , , and as covariates. Since our grid for covers much smaller values than [12], our regression model contains 28 covariates in total including the constant and they are collectively denoted by . The regression coefficients and error variance are denoted by , respectively. Based on these estimates, we set . The list of the used covariates and estimate of the polynomial regression are presented in Table 2. Also, and is almost equal to one.

Covariate Estimate Std. Error Covariate Estimate Std. Error
Intercept -2.45 2.28 -7.22 1.16
-8.55 4.02 -1.16 1.97
-6.20 1.37 -3.77 7.23
-1.20 4.10 -1.78 4.42
-5.14 1.57 2.98 1.07
-2.58 2.52 2.51 8.68
-8.23 1.02 -4.95 1.93
-1.16 1.76 -2.16 1.19
-6.19 1.09 1.13 1.03
-5.52 4.55 1.01 2.33
-1.76 1.67 -5.61 1.51
-1.62 1.84 -9.43 7.19
-5.23 6.76 4.93 2.04
-2.23 3.17 -6.46 1.21
Table 2: Regression estimates for the prior distribution of

s2.4 Prior predictive distribution

Under the prior distributions specified as above, we generated random samples from the prior predictive distribution given by

where and . The prior predictive distribution presented in Figure S1 shows that the observed data are included in the prior prediction intervals indicating that our settings of the prior distributions are reasonable.

Figure S1: Prior predictive distributions for (left), and (right). The observed data points are plotted by the black dots.

S3 Posterior distribution and Markov Chain Monte Carlo sampling algorithm

Let denote the collection of the unknown parameters, , and denote the joint prior distribution specified in Section S2. The posterior distribution of the latent variables and is given by


where and are the conditional distribution of the first and second equations in the model (S1), respectively. Since the posterior distribution (S2) has a complicated form, the posterior inference is based on the Markov Chain Monte Carlo (MCMC) sampling method. Specifically, we adopt the Metropolis-Hastings (MH) within Gibbs sampling, in which the random numbers are alternately sampled from the full conditional distributions and . Our MCMC algorithm repeats the following steps:

  • Sample from for .

  • Sample from .

  • Sample from .

In each step, the sampling is carried out by using the Gaussian random walk MH algorithm where the step sizes are adjusted such that the acceptance rates are between 0.2 and 0.4. Note that given the sampled values of , PI and PT, the values of , and hence and are immediately determined through their prior distributions. We run the algorithm for 50000 iterations after 10000 iterations of initial burn-in period. Then every 10th draws (5000 draws) is retained to be used in our analysis and the point estimates and 95% credible intervals are the sample medians and th and

th sample quantiles of the MCMC output.

To carry out future predictions for the periods , we generate random numbers from the posterior predictive distribution given by

Given the MCMC outputs and values of , and as specified in our prediction scenarios, the future prediction of the epidemic is carried out by repeating the following steps for :

  • If , sample from , otherwise sample from .

  • sample from

Our point prediction of the trajectory of the infectious proportion is obtained from the sample medians of the simulated computed at each . Similarly, the prediction interval is obtained from the th and th sample quantiles.

S4 Additional results

s4.1 MCMC and posterior distribution of

Figure S2 presents the trace plots of the MCMC output for . The figure shows that our MCMC algorithm converges to the target distributions and mixes reasonably well. Figure S3 presents the posterior distribution of for . It is shown that the observed data points are well within the 95% credible interval.

s4.2 Additional prediction results for

In Section 3 of the main text, we assessed the effect of an intervention based on the future prediction under several scenarios. Here the additional results under a longer intervention period with are provided in Figure S4. The figure shows that even if returns to the original level the intervention (), as long as the intervention period is sufficiently long, even a somewhat mild but realistic degree of intervention, such as , can lead to the termination of the epidemic.

s4.3 Prediction results for and

Figure S5 and S6 present the prediction results for and under the same scenarios in the main text for . While the predicted timings of the peak and peak intensities vary depending on the choice of , the figures provide the same epidemiological insights as discussed in Section 4.

Figure S2: Trace plots of the MCMC output ().
Figure S3: Posterior distribution and 95% credible interval of (). The observed data points are shown by the black dots.
Figure S4: Future prediction with (75 days of intervention after April 22) for (left), (center) and (right). The red, black and grey curves represent the future point prediction without intervention shown in Figure 2 in the main text, point prediction under each scenario and one-sided upper 95% prediction intervals, respectively.
Figure S5: Future prediction under the nine combinations (the period of the intervention) and (the multiplier for after the intervention) for . The upper, middle and lower panels correspond to and , respectively. The red, black and grey curves respectively represent the future point prediction without intervention shown in Figure 2 of the main text, point prediction under each scenario and one-sided upper prediction intervals.
Figure S6: Future prediction under the nine combinations (the period of the intervention) and (the multiplier for after the intervention) for . The upper, middle and lower panels correspond to and , respectively. The red, black and grey curves respectively represent the future point prediction without intervention shown in Figure 2 of the main text, point prediction under each scenario and one-sided upper prediction intervals.