1 Introduction
The Ensemble Kalman Filter (EnKF) is a Bayesian filtering algorithm used to estimate unknown states and parameters of nonlinear systems by combining model predictions with available system observations [1, 2, 3]. While these algorithms are commonly used for data assimilation in applications to weather prediction [4, 5, 6] and guidance, navigation, and control [7, 8, 9], ensemble Kalmantype filters have recently been utilized for parameter estimation and forecast prediction in a variety of epidemiological studies [10, 11, 12, 13, 14, 15, 16].
The filtering process comprises a twostep sequential updating scheme of forward prediction via a prescribed model and correction with the available data. The inverse problem of estimating the system unknowns therefore involves defining two key components: the forward model and the observation model. The forward model describes the dynamics of the system, predicting how the model states are propagated forward in time. In particular, we consider ordinary differential equation (ODE) models of the form
(1.1) 
where is the righthandside function defining the dynamics,
is a vector representing the states of the system,
is the parameter vector, and is the initial condition. The observation model relates the data set back to the systems inputs, assuming discrete observations of the form(1.2) 
where is an observation at time , , , is the observation function relating the model states and parameters , and is the observation error, assumed here to be Gaussian noise.
While the forward model (1.1) has an immediate effect on the filter output in driving the system dynamics, the choice of an appropriate observation function in (1.2) is vital in well connecting the available data with the system variables. This plays a particular role in the analysis step of the EnKF, in which the observed data are compared to model observation predictions computed using . Although it seems straightforward that using an observation function that accurately represents the data should result in more accurate estimates of model inputs, it is not immediately clear how significantly using a less accurate observation function affects results. Knowledge of this becomes especially important in cases where the exact observation function is not immediately known, or when modeling assumptions are necessary to simplify the situation compared to the actuality of data collection (i.e., when the bridge between the real data and the model input is more complex than that of the simplifying assumptions).
In the setting of epidemiology, forward models vary from general compartment models for the spread of infectious diseases [17, 18, 19, 20] to more complex, diseasespecific models describing, e.g., the spread of influenza [13, 21, 22], the AIDS epidemic [17, 23], and the novel coronavirus (COVID19) [10, 24, 25, 26, 27]. These models are often variations of the wellknown KermackMcKendrick model [28] and are otherwise referred to as SusceptibleInfectiousRecovered (SIR) models, with compartments representing susceptible, infectious, and recovered portions of a population. While SIRtype models are wellestablished in governing the forward dynamics of epidemiological systems, a variety of different observation functions are used in the literature to model the available data [29, 16, 30, 26, 31, 32, 33, 34]. The goal of this research is therefore to study the effects of observation function selection on ensemble Kalman filtering estimates for epidemic models. In particular, we aim to analyze the impact of using different observation functions with various degrees of complexity when employing the EnKF for both state and parameter estimation in this setting.
The types of data available in epidemiological applications vary due to key differences in data collection, which may differ by disease type, geographical spread, and frequency of reported cases (daily, weekly, monthly, etc.) [18, 17, 19]. For example, Figure 1 shows reported measles incidence data from two cities in the United States with different population sizes, New York City (NYC) and Baltimore, MD, during the prevaccine era [35]
. While each data set was collected on a monthly basis, there are clear differences in the magnitude and frequency of the outbreaks in each city: Baltimore starts with infrequent outbreaks of larger magnitude in the early years but tends towards smaller, more frequent outbreaks in the later years; on the other hand, NYC has a more regular biannual peaking pattern, with the larger outbreaks being typically much larger in magnitude than those of Baltimore. The magnitudes of the outbreaks are further tied to the reporting probability of cases for each location, which is estimated to be around 1 in every 8 cases reported in NYC and 1 in every 3 or 4 cases reported in Baltimore during this time
[35].In the epidemic modeling literature, the choice of observation function varies from assuming direct measurements of the infectious population [29, 33, 30] to more elaborate functions accounting for the cumulative number of cases over a given collection period [16, 32, 34]. These functions may also include a reporting probability to account for the underreporting of cases [16, 26]. Considering the different types of data available, it may not be immediately clear which observation function is best suited for a given problem when using the EnKF in this setting. Further, if a suboptimal observation function is selected (in simplifying the model or due to lack of supporting information), questions arise regarding how this choice affects the resulting EnKF estimates. Therefore, in this research we aim to analyze the impact of observation function selection on state and parameter estimation using the EnKF, specifically when utilizing simple observations functions to model more complex data.
To this end, we demonstrate the effects of using four different observation functions of various complexity levels inspired by epidemic modeling. In particular, using synthetic data generated with the most elaborate of the four observation functions considered, we show how using a suboptimal affects the resulting filter estimates of both the model states and parameters in different implementations of the EnKF. We consider three filtering scenarios with increasing complexity: state estimation with know parameters; combined state and constant parameter estimation; and combined state and timevarying parameter estimation. We further demonstrate how the observation noise covariance matrix can be utilized in certain cases to help offset the effects of using a suboptimal observation function.
The paper is organized as follows: Section 2 provides a review of ensemble Kalmantype filtering algorithms, first outlining the steps of the classic Kalman Filter and then highlighting the differences in the EnKF for state estimation and the augmented EnKF for combined state and parameter estimation in nonlinear systems. Section 3 describes both the SIR model and the four different observation functions used for the epidemic application considered in this work. Section 4 contains the numerical results, which analyze the use of each observation function in the different filtering scenarios described, as well as the effects of modifying the observation noise covariance. Section 5 presents a discussion of the results, and Section 6 gives a brief summary and conclusions.
2 Review of Kalman Filtering Methods
Named after Rudolf E. Kalman, an electrical engineer and mathematician who received the National Medal of Science for his work on the algorithm, the Kalman filter was famously used by NASA during the Apollo missions [36, 37]
and in a variety of application areas since. The algorithm was derived specifically for linear and Gaussian systems, utilizing the fact that Gaussian distributions remain Gaussian under linear transformation. Various extensions of the Kalman filter have been derived to accommodate when these assumptions cannot be met. Here we review the classic Kalman Filter (KF) and two of its nonlinear extensions, the Ensemble Kalman Filter (EnKF) for state estimation and the augmented EnKF for combined state and parameter estimation.
Kalman filtering comprises a two step process: predicting through the use of a forward model and correcting through the use of available data. Under the Bayesian framework [38, 39, 40]
, where unknowns are treated as random variables, the goal of Kalman filtering is to sequentially update the posterior distribution of the unknowns conditioned on the observed data. The classic KF estimates the model states of linear systems by updating the mean and covariance of an underlying Gaussian probability distribution through use of analytically described formulas
[41]. The EnKF accommodates use of nonlinear models by incorporating ensemble statistics into the classic KF, using an ensemble of discrete realizations from the underlying probability distribution to calculate the ensemble mean and covariance [1, 2]. The augmented EnKF incorporates the simultaneous estimation of constant parameters via crosscorrelation information with the model states [42, 43]. The augmented EnKF can be further modified to accommodate timevarying parameter estimation through use of parameter tracking [44, 45, 46].Each algorithm begins with a prior distribution on the system unknowns (which may include unknown model states and/or parameters) and proceeds in a predictorcorrectortype process to sequentially update the joint probability distribution of the unknowns conditioned on the available data. Letting
denote the joint distribution of states
and parameters conditioned on the data at time , the forward model describing the state evolution, as in (1.1), first propagates the state prediction forward to time , forming a prediction density . The algorithm then corrects the prediction by using the observation model (1.2) to compare the model predictions with the available data, updating the model states and parameters to form the joint posterior distribution . This process is repeated for each over the time span of available data. Note that if the model parameters are known, the posterior distribution of interest simply reduces to for state estimation.2.1 Classic Kalman Filter for State Estimation
The classic KF for state estimation begins with a Gaussian prior distribution
(2.1) 
with mean and covariance matrix
. The state evolution and observation equations forming the statespace model are both assumed to be linear discretetime Markov models:
(2.2)  
(2.3) 
for , where the random variables and denote the model states and observations, respectively. The operators and are matrices (assumed here to be constant over time) and the noise processes and are Gaussian random variables with respective covariance matrices and (also assumed here to be timeinvariant).
During the prediction step at time , the state prediction mean , and covariance matrix of the prior distribution are propagated forward in time through the use of the state evolution equation (2.2) via the analyticallyderived formulas
(2.4)  
(2.5) 
where and are the predicted mean and covariance of the underlying Gaussian distribution at time without yet taking into account the data at this point.
In the analysis step, the predicted mean and covariance estimates are corrected using the data and the observation model (2.3) via the updating formulas
(2.6)  
(2.7) 
where is the identity matrix and is the Kalman gain matrix, defined as
(2.8) 
After completion of the analysis step, we are left with the Gaussian posterior distribution at time . The filter continues through this process, letting for , where is the time of the last observation.
2.2 Ensemble Kalman Filter for State Estimation
While the classic KF provides the optimal solution under assumptions of linearity and Gaussian distributions, these assumptions limit the algorithm’s applicability to nonlinear models. The EnKF allows for nonlinear models by incorporating ensemble statistics into the updating equations. The statespace model in this case is given by
(2.9)  
(2.10) 
where and are nonlinear operators. In this work, is the solution to the ODE system (1.1) and is the same observation function defined in (1.2), here with the dependence on parameters suppressed. The EnKF maintains a similar twostep procedure of predicting and correcting, but the probability distributions are represented in terms of discrete samples, and each sample point (or ensemble member) is propagated independently. Ensemble statistics are employed to calculate the mean and covariance of the sample at each step.
Let represent the discrete sample from the underlying probability distribution at time , such that
(2.11) 
where is the ensemble size and for each . In the prediction step, each individual ensemble member is updated via the state evolution equation (2.9) by
(2.12) 
where each is a realization of the random variable . The prediction mean and covariance are then computed using the following ensemble statistics formulas:
(2.13)  
(2.14) 
The analysis step is performed by correcting each individual ensemble member via the equation
(2.15) 
where
(2.16) 
is an artificial observation ensemble of size generated around the data point ,
(2.17) 
is the model prediction of the observation, and is the Kalman gain. To accommodate nonlinear observation models, the Kalman gain can be defined using crosscorrelation information via the formula
(2.18) 
where is the cross covariance of the state and observation predictions, is the forecast error of the observation prediction ensemble, and is the observation noise covariance [47]. This results in the posterior ensemble
(2.19) 
and the corresponding posterior ensemble mean and covariance computed via ensemble statistics. As in the classic KF, this process continues while .
2.3 Augmented EnKF for State and Constant Parameter Estimation
While the EnKF allows for more flexibility in accommodating nonlinear models for both the states and observations, it can be further modified to simultaneously estimate both model states and constant parameters through use of augmented vectors. The statespace model becomes
(2.20)  
(2.21) 
where denotes the unknown system parameters, and the goal is to estimate a joint probability distribution representing the model states and parameters conditioned on the available data. To this end, the discrete sample from the probability distribution at time becomes
(2.22) 
where each ensemble member is now a joint stateparameter sample. The prediction step of the filter remains largely the same, propagating the model states forward via the state evolution equation (2.20), such that
(2.23) 
while the parameter estimates remain unaltered, with
(2.24) 
for each .
After completion of the prediction step, the state predictions and parameters are augmented to form the joint sample vectors
(2.25) 
The prediction ensemble mean and covariance are then computed using ensemble statistics via the joint sample vectors, thereby obtaining correlation information between the model states and parameters that is used in the analysis step
(2.26) 
to simultaneously update the state and parameter estimates, where the crosscorrelation information is encoded in the Kalman gain.
2.4 Augmented EnKF for TimeVarying Parameter Tracking
While the augmented EnKF is commonly employed for estimating constant parameters, it can be modified in the prediction step to track timevarying parameters if the parameters change more slowly than the system states. More specifically, during the prediction step of the filter, timevarying parameters are updated using a random walk of the form
(2.27) 
where defines the parameter drift. Parameter tracking allows the filter to estimate timevarying functions without a priori assumptions on functional form. The drift covariance matrix plays a vital role in the algorithm’s ability to successfully track the timevarying parameter of interest without diverging [46].
3 Observation Functions for Epidemic Modeling
In this section we describe both the forward model and the observation functions used in the EnKF framework for our epidemic application. As seen in Section 2, the observation function plays a significant role in connecting the filter estimates in the prediction step back to the data in the analysis step. For our analysis in this work, we model the forward dynamics of the system using the SusceptibleInfectiousRecovered (SIR) model detailed below, along with four different observation functions of increasing complexity used in epidemic modeling.
As noted in the introduction, the SIR model is a classic model in epidemiology [28, 19, 18, 17]. The standard form of the model comprises three compartments describing three different states of a population at a given time : the susceptible population, , are healthy with the chance of contracting the disease; the infectious population, , are able to transmit the disease to others; and the recovered population, , are those who have recovered from and have become immune to the disease [28]. Figure 2 gives a schematic representation of the SIR model used in this work. Assuming a constant population size , the recovered population can be written as a function of the susceptible and infectious populations, so that . The corresponding system of differential equations describing the rate of change of the susceptible and infectious populations is given by
(3.1)  
(3.2) 
where is a constant birth and death rate, is the transmission parameter at time , and is the constant recovery rate. The solution to the system (3.1)–(3.2) defines the forward propagation in the prediction step of the filter; i.e., the function in (2.12) for state estimation and (2.23) for joint state and parameter estimation, where and denotes the unknown model parameters.
The focus of this work is the observation function in (1.2), which is responsible for relating the observed data back to the model predictions in the analysis step of the Kalman filtering algorithms. More specifically, this function is used to compute the observation predictions in (2.17), which are then compared with the available data in either (2.15) for the EnKF or (2.26) for the augmented EnKF. In a given application, the observation function aims to model how the set of observed data corresponds to the forward model variables. For the epidemiological applications considered in this work, the range of observation functions used in the literature varies from linear models assuming direct measurements of to nonlinear models accounting for the cumulative number of cases over a specified time interval.
To illustrate the effects of observation model selection on the results of the filtering algorithms described in Section 2, we employ four observation functions of varying complexity, representing four different data collection assumptions:

The observed data at time is assumed to be a direct measurement of the infectious population at time :
(3.3) 
The observed data at time is assumed to be an underreported measure of the infectious population at time :
(3.4) Here the constant parameter is the reporting probability, which denotes the percentage of the population actively reporting cases. This function accounts for the fact that not every infectious person will report their illness, and thus data sets may not be complete.

The observed data at time is assumed to be a measurement of the total number of cases accumulated from time to time :
(3.5) 
The observed data at time is assumed to be an underreported measure of the total number of cases accumulated from time to time :
(3.6) where, as in (3.4), is the reporting probability.
In the following sections, we refer to the observation functions in (3.3)–(3.6) as Cases 1–4, respectively. Using these four observation functions together with the SIR model (3.1)–(3.2), we perform numerical tests to demonstrate the effects that the observation function selection has on the resulting filter estimates in three different Kalman filtering scenarios: the EnKF for state estimation with known parameters; the augmented EnKF for joint state and constant parameter estimation; and the augmented EnKF with parameter tracking for joint state and timevarying parameter estimation. More specifically, we show in the numerical results how employing the observation functions in Cases 1–3 using synthetic data generated with the most elaborate observation function in Case 4 results in less accurate filter estimates.
4 Numerical Results
In this section we demonstrate the effects of observation function selection on the resulting EnKF estimates by incorporating the four observation functions from Section 3 into the EnKF framework with synthetic data generated using the observation function in Case 4. We analyze the results in three different filtering scenarios, including state estimation with known parameters and combined state and parameter estimation with unknown constant and timevarying parameters. We further illustrate how the observation noise covariance matrix in the filter affects the results.
All numerical experiments were performed using MATLAB^{®} programming language. The filtering algorithms were hand coded in MATLAB using ode15s to numerically solve the ODE system (3.1)–(3.2) in the prediction step. For each numerical experiment, we used ensemble members to represent the underlying probability distributions. While not shown, we performed additional numerical tests to ensure that the results obtained using 100 ensemble members were consistent with those using larger ensemble sizes. Initial ensembles for the state and parameter values were drawn from uniform prior distributions containing but not centered at the true initial values and parameter values used in generating the synthetic data. For the numerical experiments in Sections 4.1 and 4.2, we set the model noise covariance matrix to be , where is the identity matrix, with and the observation noise covariance matrix to be with . In Section 4.3 we increase the value of to include additional error accounting for potential use of suboptimal observation functions.
Data was generated using the observation function in (3.6) (i.e., Case 4) and the parameter values in Table 1, comparable to those used in modeling the spread of measles; a similar example was considered in [16]. The timevarying transmission parameter was modeled as
(4.1) 
with constant average transmission and amplitude . Assuming that 95% of the population was initially susceptible and 2% initially infected, we solved the SIR model (3.1)–(3.2) numerically using ode45
for 120 years to first reach a steady state, then restarted the model simulation for data collection over a time span of 10 years. Observations of the accumulated number of cases were collected monthly with 70% reporting and were corrupted with a small amount of Gaussian noise, for a total of 120 noisy data points. The added observation error was normally distributed with mean 0 and standard deviation 0.1. Any negative data points were set to
. Figure 3 plots the data, along with the true solution curves for and .Parameters Used in Synthetic Data Generation  

Notation  Meaning  Value 
Population Size  90,000  
Average Transmission  1800  
Amplitude  0.08  
Recovery Rate  100  
Birth/Date Rate  0.02  
Reporting Probability  0.70 
Using the data described above, we perform the following numerical experiments: In Section 4.1 we test how using the four different observation functions affects the EnKF for state estimation in accurately estimating the time series of both the susceptible and infectious populations. In Section 4.2 we study the effects when using the augmented EnKF to estimate the model states along with the constant parameters and relating to the functional form of the transmission parameter in (4.1). We then analyze results when using the augmented EnKF with parameter tracking to estimate the full time series of the transmission parameter without assuming a functional form. Finally, in Section 4.3 we illustrate how the observation noise covariance matrix in the filter can be utilized in certain cases to help account for uncertainty in the observation function selection.
We present the results for each experiment in figures plotting the time series estimates of , , the monthly number of cases (when applicable), and any estimated parameters (as relevant). Along with the plots, we also compute the mean squared error (MSE) of the model states in each of the trials by calculating the average of the squared differences between the true and estimated susceptible and infectious populations at each time step:
(4.2)  
(4.3) 
where and are the true model states in Figure 3, and and are the EnKF mean estimates of the states, respectively. Here is the total number of time points of comparison. As previously noted, we refer to the observation functions defined in (3.3)–(3.6) as Cases 1–4.
4.1 State Estimation with Known Parameters
Assuming known values for the parameters in Table 1, we employ the EnKF for state estimation (outlined in Section 2.2) to estimate , , and the monthly number of cases when applicable. Note that when using the observation functions in Cases 3 and 4, the monthly number of cases is computed as a separate function; in Cases 1 and 2, the data is interpreted instead as a direct percentage of the estimated . Figure 4 shows the results when using the observation functions in each of the four cases, and Table 2 lists the corresponding MSE values.
Since the observation function in Case 4 was used in generating the synthetic data, it is not surprising that the filter performs well when using this function. As seen in Figure 4, in this case the EnKF is able to accurately estimate all components of the model, with the EnKF mean estimates nearly identical to the underlying true model states. Although the estimated standard deviation curves are plotted, the standard deviations around the mean are so small that they are not easily visible in the plots, suggesting a high level of confidence in the mean estimates. This test validates that the filter is able to well estimate the corresponding model states when using an observation function that well interprets the available data.
By assuming full reporting in the Case 3 observation function, some detrimental effects on the corresponding filter estimates already become clear. In particular, since the Case 3 function does not account for underreporting, the filter fits the monthly number of cases more directly to the observed data points. While the estimated and curves follow the same general shape as in the original model, the magnitudes of the estimates are generally higher for with alternating higher peaks for . These results are further evident in the MSE values recorded in Table 2.
When using the observation functions in Cases 1 and 2, we see more significant effects due to the interpretations of the data: these functions interpret the observed data as either direct observations of the infectious population (Case 1) or a proportion of the population (Case 2) at a given time. The results in Figure 4 show that when using the observation function in Case 2, neither the shape nor magnitude of the EnKF estimates resemble the true model states. Similar results follow when using the observation function in Case 1, with discrepancies in both the shape and magnitude of the underlying true curves. For each of these cases, the corresponding MSE values are noticeably high. In particular, while the MSE values for are of similar magnitude as in Case 3, the errors are much larger for .
MSE for State Estimation  

Case  
1  
2  
3  
4  22.30  0.01 
4.2 Combined State and Parameter Estimation
In this section we consider the augmented EnKF for combined state and parameter estimation. More specifically, we analyze the effects of observation function selection when estimating both constant and timevarying parameters relating to the disease transmission , modeled as in (4.1). For constant parameter estimation, we assume the form of in (4.1) and aim to estimate the constants and , representing the average transmission and amplitude of variation, respectively. For timevarying parameter estimation, we do not assume a known form for and instead use parameter tracking to approximate the time series. In both scenarios, we assume that the remaining model parameters are known and set to the values in Table 1.
4.2.1 Constant Parameter Estimation
Here we apply the augmented EnKF (outlined in Section 2.3) to estimate , , the monthly number of cases when applicable, and the constant parameters and from the transmission function (4.1). Table 3 lists the MSE values when employing the observation functions in Cases 1–4, and Figures 5 and 6 show the results for Cases 4 and 3, respectively.
As in the previous experiment, the augmented EnKF using Case 4 is able to estimate the true model states and unknown parameters with high accuracy. As shown in Figure 5, the initial uncertainty around the estimates shrinks around year 2, when the parameter estimates converge to their true values. The resulting EnKF posterior mean estimate for converges to the true value with a relative error of and for with a relative error of .
MSE for Constant Parameter Estimation  

Case  
1  
2  
3  
4 
However, not accounting for underreporting when using the Case 3 observation function causes the filter to have much more difficulty in estimating and . Figure 6 shows a drop in magnitude for the estimate of , resulting in a posterior estimate close to half of the true value. There is more instability in the estimate for , resulting in a posterior estimate not as far from the true value but not fully converged. In this case, the relative error in the posterior estimate for is and for is , respectively. The effects of the inaccurate parameter estimates are also seen in the state estimates, most noticeably in the estimate for shown in Figure 6 and in the corresponding MSE listed in Table 3.
Similar to the state estimation experiments, constant parameter estimation results continue to degrade when using the Case 1 and Case 2 observation functions. While not shown here to avoid redundancy, in both of these cases the filter is unable to find the true values of and and the estimates do not converge, diverging more drastically than in the previous case. While the MSE values in Table 3 for stay within the same magnitude for Cases 1 and 2, there is a large increase in these values as compared to Cases 3 and 4.
4.2.2 TimeVarying Parameter Estimation
In this section we apply the augmented EnKF with parameter tracking (outlined in Section 2.4) to estimate , , the monthly number of cases when applicable, and the timevarying transmission parameter . While the true used in generating the data has the form given in (4.1), in this setting we do not assume a known form of the parameter and instead use a random walk to track the parameter time series as the algorithm progresses. Table 4 lists the MSE values when employing the observation functions in Cases 1–4, and Figures 7 and 8 show the results for Cases 4 and 3, respectively.
By not assuming a known form for , the timevarying parameter estimation presents the most challenging of the filtering problems considered in this work. As shown in Figure 7, even using the Case 4 observation function does not result in a highly accurate reconstruction of the time series of . While the periodicity of the underlying true function is not well captured, the parameter tracking estimate of does remain within the same range of values and fully captures the true function within the standard deviation curves around the mean. The corresponding EnKF estimates of the model states and monthly number of cases still remain relatively accurate, with noticeably larger uncertainty bounds for .
MSE for TimeVarying Parameter Estimation  

Case  
1  
2  
3  
4 
Similar to the constant parameter estimation experiments, using the observation function in Case 3 here degrades the filter’s performance in estimating as well as the corresponding model states. As shown in Figure 8, while somewhat maintaining the sinusoidal shape, the estimate of the transmission function begins to drift downward outside of the range of true function values. This accounts for a corresponding increase in the susceptible population estimate and underestimation of the infectious population and monthly number of cases.
Consistent with the previous experiments, using the observation functions in Cases 1 and 2 causes further degradation in performance, with the filter unable to well track the underlying transmission parameter, thereby resulting in similarly poor estimates of the model states. This is reflected in the resulting MSE values listed in Table 4.
4.3 Observation Noise Covariance
The numerical experiments conducted in the previous sections demonstrate how using different observation functions for the same observed data within the EnKF framework affects the resulting estimates for both the model states and parameters. More specifically, the results show that using an observation function that inaccurately interprets the data can lead to inaccurate estimates of the system inputs. In this section, we illustrate how the observation noise covariance matrix in the filter can be utilized in certain cases to offset some of the detrimental effects of using a suboptimal observation function.
In the observation model (1.2), the observed data are assumed to be outputs from the observation function corrupted by Gaussian noise, where the noise is normally distributed with zero mean and covariance . The noise covariance is therefore directly linked to the level of uncertainty in the observed data. While the noise term in (1.2) represents observation error, generally due to error in measurement or data collection, we can interpret this term as also including modeling error relating to uncertainty in the observation function. Error due to misfit between the observation function and the data is sometimes referred to as representation error or observationoperator error [48, 49]. More specifically, we assume that the observation noise covariance matrix is the sum of two error covariances, , where represents the error covariance due to measurement noise and represents the error covariance due to representation error.
Letting
with assumed constant variance
, we analyze the effects of increasing the standard deviation of the perceived noise in the data to account for additional uncertainty in the observation function selection due to possible representation error (i.e., ). Figure 9 shows the MSE results (in logscale) averaged over 10 simulations each for increasing values of (namely, ) when using the observation functions in Cases 4, 3, and 2 for state estimation with known parameter values. Figure 10 shows state estimation results when using the Case 4, 3, and 2 observation functions, respectively, with (i.e., the value used in the comparable numerical experiments in Section 4.1).As the MSE plots in Figure 9 suggest, increasing when using the true observation function in Case 4 leads to an increase in the MSE for both and . Even so, the resulting state estimates retain the shape and magnitude of the true model states with slightly larger uncertainty bounds, as illustrated in Figure 10. For Cases 3 and 2, however, increasing results in an order of magnitude decrease in the MSE for that remains fairly consistent for and , while maintaining the same order of magnitude in the MSE for . Compared to the results in Figure 4 for Case 3 with , the results in Figure 10 demonstrate that increasing (to 10 in this case) allows the filter to better accommodate for the lack of reporting probability in the Case 3 observation function. The results for Case 2 show that while the magnitude of the estimate improves compared to the previous results in Figure 4, increasing in this case does not improve the overall shape of the state estimates, still notably overestimating the infectious population in several peaking years. While not shown, similar results hold for Case 1 as for Case 2.
A similar procedure can be applied to analyze the effects of increasing in the augmented EnKF for constant and timevarying parameter estimation. For constant parameter estimation, increasing when using the Case 4 observation function slightly increases the uncertainty bounds but still results in parameter estimates for both and that well converge to their true values. When using the Case 3 observation function, increasing can help to improve the convergence of the resulting parameter estimates, as illustrated in Figure 11 with . Compared to the results in Figure 6 with , while both parameters are still underestimated, here we see the estimates remaining within closer ranges to the true parameter values, especially for . The corresponding estimate of is also noticeably improved, with an MSE value of . For timevarying parameter estimation, increasing does not improve the estimates of significantly for any of the observation functions considered.
5 Discussion
The goal of this work was to analyze the effects of observation function selection on ensemble Kalman filtering state and parameter estimates for epidemic models. In particular, we considered a standard SIR model and four different observation functions of varying levels of complexity within three EnKF frameworks: state estimation with known parameters; combined state and constant parameter estimation; and combined state and timevarying parameter estimation. With synthetic data generated using the most complex observation function, we performed numerical experiments estimating the time series of two of the three compartments of the SIR model, namely, the susceptible and infectious populations, and . We also estimated the parameters and from the transmission function in (4.1) when testing constant parameter estimation and the full time series of for timevarying parameter estimation.
Our results demonstrate that when utilizing the true underlying observation function (3.6), the filter is able to estimate the components of the original system with little error. However, when the true observation function is not known, using the suboptimal observation functions (3.5), (3.4), and (3.3) results in less accurate estimations, with the accuracy decreasing as the observation function becomes further from the truth. While in this work we generated data with the most complex observation function, the reverse situation of generating data with the least complex observation function (3.3) yields similar results.
For the state estimation with known parameters in Figure 4, when using true the observation function in (3.6) (i.e., Case 4), the filter was able to estimate the true model states with a high accuracy. When using the observation function in Case 3, the absence of the reporting probability mainly affected the magnitude of the results, initially underestimating with some larger peaks occurring biannually in later years and consistently overestimating . In the estimated monthly number of cases, the filter fits to the observed data points and does not account for underreporting due to the lack of the reporting probability parameter in the Case 3 observation function. The results when using the observation functions in Cases 2 and 1 further degrade: These functions interpret the data as direct observations of the infectious population (Case 1) or a proportion of the infectious population (Case 2), leading to issues in both the magnitude and shape of the corresponding state estimates.
Combined state and parameter estimation increases the complexity of the problem, since the parameters relating to disease transmission are assumed to be unknown and therefore must be estimated along with the model states. For constant parameter estimation, the filter converged to the true values of and after approximately 2 years of data when using the Case 4 observation function, as shown in Figure 5. However, when using the function in Case 3, the filter had significant difficulty estimating the parameters: While the estimate for stayed within the same general value region, neither estimate well converged and the estimate for diverged towards about half of its true value, as shown in Figure 6. The effects are also clearly seen in the estimate for , which no longer tracks the true shape and increases well outside of the original range of values. As with the state estimation, parameter estimation results further degraded when using the observation functions in Cases 1 and 2, with the parameter estimates diverging to inaccurate values.
For timevarying parameter estimation, tracking the full transmission function without assuming a known form was challenging even when using the true observation function in Case 4, as shown in Figure 7. While the filter mean estimate generally follows the shape and the true transmission function is fully captured within relatively small uncertainty bounds, the filter estimate is less smooth and does not maintain the periodicity of the true function. However, Figure 8 shows that when using the observation function in Case 3, the filter mean estimate drifts away from the true transmission function, thereby underestimating transmission and overestimating the susceptible population. Parameter tracking using the functions in Cases 1 and 2 showed similar tendencies to diverge and inaccurately estimate the timevarying transmission parameter.
Numerical experiments also considered the effects of increasing the standard deviation of the observation noise in the filter to account for potential representation error due to uncertainty in the choice of observation function. The observation functions considered in this work range from linear misfit (between Case 3 and Case 4) to more substantial nonlinear discrepancies (between Cases 1 and 2 and Case 4). As illustrated in Figure 10 for state estimation, when using the true observation function in Case 4, increasing caused a small increase in corresponding MSE values for and but overall did not negatively impact the estimation. Results in Figures 10 and 11 further show that increasing helped to improve state estimation and constant parameter estimation results when using the observation function in Case 3, which differs from the true observation function by a constant. However, while the results for Case 2 show some improvement in magnitude for the estimated , the overall shape of the true function was not captured. These results suggest that increasing the observation noise covariance may help to offset the effects of a missing constant or linear difference in the observation function, but it is unable to fully account for more significant modeling discrepancies.
Further, the results in Figure 9 suggest that continuing to increase does not result in better estimates after a certain point, with MSE values for Cases 2 and 3 leveling off after in the state estimation considered. In certain cases, continued increase of may also lead to numerical integration issues. Future work aims to incorporate systematic mechanisms within the EnKF framework to automatically adjust to account for suboptimal observation function selection, with the goal of improving filter estimates while avoiding overinflation. Recently proposed alternative methods for addressing representation error in data assimilation include: using the ensemble covariance to update the observation noise covariance [50]; applying an iterative scheme to adjust the observation function with nearest neighbors information [51]
; and using a secondary filter with tools from machine learning to estimate the representation error
[52]. Future work also aims to address observation function selection in the context of particle filters [53, 54] and deterministic optimization algorithms [55, 56] for parameter estimation in epidemiological applications.6 Summary and Conclusions
The observation function plays a critical role in connecting the observed data with the forward model variables in the ensemble Kalman filtering framework. In this work we analyze the effects of observation function selection when using the EnKF for state and parameter estimation in the setting of epidemic modeling, where a variety of different observation functions are commonly used in the literature. More specifically, we examine the use of four observation functions of different forms and various levels of complexity in connection with the classic SIR model. Our results demonstrate the importance of choosing an observation function that well interprets the given data on the corresponding EnKF state and parameter estimates for both constant and timevarying parameters, especially as the problem becomes more difficult by including additional unknowns. Numerical experiments further illustrate how increasing the observation noise covariance can help to account for representation error in the selected observation function in cases with linear misfit.
Acknowledgments
The authors would like to thank the Department of Mathematical Sciences at WPI, with a special thanks to Suzanne Weekes, Sarah Olson, Michael Yereniuk, and Caroline Johnston, for encouraging this research.
Funding: This work was partially supported by the Henry Luce Foundation under grant number 9136 to WPI (Clare Boothe Luce Research Scholarship to L. Mitchell) and the National Science Foundation under grant number NSF/DMS1819203 (PI A. Arnold).
Declaration of Competing Interest: None.
References
 [1] G. Evensen. Sequential data assimilation with a nonlinear quasigeostrophic model using Monte Carlo methods to forecast error statistics. J. Geophys. Res., 99:10143–10162, 1994.
 [2] G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalman filter. Mon. Weather Rev., 126:1719–1724, 1998.
 [3] G. Evensen. The ensemble Kalman filter: theoretical formulation and practical implementation. Ocean Dynamics, 53:343–367, 2003.
 [4] P. L. Houtekamer and F. Zhang. Review of the ensemble Kalman filter for atmospheric data assimilation. Mon. Weather Rev., 144:4489–4532, 2016.
 [5] J. D. Annan, J. C. Hargreaves, N. R. Edwards, and R. Marsh. Parameter estimation in an intermediate complexity earth system model using an ensemble Kalman filter. Ocean Modeling, 8:135–154, 2005.
 [6] M. Buehner, R. McTaggartCowan, and S. Heilliette. An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: VarEnKF. Mon. Weather Rev., 145:617–635, 2017.
 [7] G. Chowdhary and R. Jategaonkar. Aerodynamic parameter estimation from flight data applying extended and unscented Kalman filter. Aerosp. Sci. Technol., 14:106–117, 2010.
 [8] N. Shantha Kumar and T. Jann. Estimation of attitudes from a lowcost miniaturized inertial platform using Kalman filterbased sensor fusion algorithm. Sadhana, 29:217–235, 2004.
 [9] Ngatini, E. Apriliani, and H. Nurhadi. Ensemble and fuzzy Kalman filter for position estimation of an autonomous underwater vehicle based on dynamical system of AUV motion. Expert Syst. Appl., 68:29–35, 2017.
 [10] R. Li, S. Pei, B. Chen, Y. Song, T. Zhang, W. Yang, and J. Shaman. Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARsCov2). Science, 368:489–493, 2020.
 [11] P. Narula, V. Piratla, A. Bansal, S. Azad, and P. Lio. Parameter estimation of tuberculosis transmission model using ensemble Kalman filter across Indian states and union territories. Infection, Disease and Health, 21:184–191, 2016.

[12]
J. Mandel, J. D. Beezley, L. Cobb, and A. Krishnamurthy.
Data driven computing by the morphing fast Fourier transform ensemble Kalman filter in epidemic spread simulations.
Procedia Computer Science, 1:1221–1229, 2012.  [13] W. Yang, A. Karspeck, and J. Shaman. Comparison of filtering methods for the modeling and retrospective forecasting of influenza epidemics. PLOS Comput. Biol., 10:e1003583, 2014.
 [14] Z. Zhan, W. Dong, Y. Lu, P. Yang, Q. Wang, and P. Jia. Realtime forecasting of handfootandmouth disease outbreaks using the integrating compartment model and assimilation filtering. Scientific Reports, 9:2661, 2019.
 [15] J. Jang, K. Jang, H.D. Kwon, and J. Lee. Feedback control of an HBV model based on ensemble Kalman filter and differential evolution. Math. Biosci. Eng., 15:667–691, 2018.

[16]
A. Arnold and A. L. Lloyd.
An approach to periodic, timevarying parameter estimation using nonlinear filtering.
Inverse Problems, 34:105005, 2018.  [17] J.D. Murray. Mathematical Biology. SpringerVerlag, Berlin, Heidelberg, 1989.
 [18] M. Martcheva. An Introduction to Mathematical Epidemiology. Springer, New York, NY, 2015.
 [19] D. Calvetti and E. Somersalo. Computational Mathematical Modeling: An Integrated Approach Across Scales. SIAM, Philadelphia, PA, 2013.
 [20] L. J. S. Allen and A. M. Burgin. Comparison of deterministic and stochastic SIS and SIR models in discrete time. Mathematical Biosciences, 163:1–33, 2000.
 [21] V. Colizza, A. Barrat, M. Barthelemy, A.J. Valleron, and A. Vespignani. Modeling the worldwide spread of pandemic influenza: baseline case and containment interventions. PLOS Med., 4:e13, 2007.
 [22] E. Lofgren, N. H. Fefferman, Y. N. Naumov, J. Gorski, and E. N. Naumova. Influenza seasonality: underlying causes and modeling theories. J. Virol., 81:5429–5436, 2007.
 [23] S. Schwager, C. CastilloChavez, and H. Hethcote. Statistical and mathematical approaches in HIV/AIDS modeling: a review. In C. CastilloChavez, editor, Mathematical and Statistical Approaches to AIDS Epidemiology, pages 2–35. SpringerVerlag, Berlin, Heidelberg, 1989.
 [24] J. Wangping, H. Ke, S. Yang, C. Wenzhe, W. Shengshu, Y. Shanshan, W. Jianwei, K. Fuyin, T. Penggang, L. Jing, L. Miao, and H. Yao. Extended SIR prediction of the epidemics trend of COVID19 in Italy and compared with Hunan, China. Front. Med., 7:169, 2020.
 [25] D. Calvetti, A. P. Hoover, J. Rose, and E. Somersalo. Metapopulation network models for understanding, predicting, and managing the coronavirus disease COVID19. Front. Phys., 8:261, 2020.
 [26] G. C. Calafiore, C. Novara, and C. Possieri. A modified SIR model for the COVID19 contagion in Italy. Preprint, arXiv:2003.14391v1, 2020.
 [27] L. Wang, Y. Zhou, J. He, B. Zhu, F. Wang, L. Tang, M. Eisenberg, and P. X. K. Song. An epidemiological forecast model and software assessing interventions on COVID19 epidemic in China. Preprint, medRxiv, https://doi.org/10.1101/2020.02.29.20029421, 2020.
 [28] W. O. Kermack and A. G. McKendrick. A contribuiton to the mathematical theory of epidemics. Proc. R. Soc. Lond. A, 115:700–721, 1997 (originally published 1927).
 [29] A. Capaldi, S. Behrend, B. Berman, J. Smith, J. Wright, and A. L. Lloyd. Parameter estimation and uncertainty quantication for an epidemic model. Math. Biosci. Eng., 9:553–576, 2012.
 [30] D. Osthus, K. S. Hickmann, P. C. Caragea, D. Higdon, and S. Y. Del Valle. Forecasting seasonal influenza with a statespace SIR model. Ann. Appl. Stat., 11:202–224, 2017.
 [31] T. T. Marinov, R. S. Marinova, J. Omojola, and M. Jackson. Inverse problem for coefficient identification in SIR epidemic models. Computers and Mathematics with Applications, 67:2218–2227, 2014.
 [32] R. Brookmeyer and J. Liao. Statistical modelling of the AIDS epidemic for forecasting health care needs. Biometrics, 46:1151–1163, 1990.
 [33] M. A. Capistran, J. A. Christen, and J. X. VelascoHernandez. Towards uncertainty quantification and inference in the stochastic SIR epidemic model. Mathematical Biosciences, 240:250–259, 2012.
 [34] J. Pan, A. Gray, D. Greenhalgh, and X. Mao. Parameter estimation for the stochastic SIS epidemic model. Statistical Inference for Stochastic Processes, 17:75–98, 2014.
 [35] W. P. London and J. A. Yorke. Recurrent outbreaks of measles chickenpox and mumps: I. seasonal variation in contact rates. Am. J. Epidemiol., 98:453–468, 1973.
 [36] L. A. McGee and S. F. Schmidt. Discovery of the Kalman filter as a practical tool for aerospace and industry. Technical report, NASA, 1985.
 [37] M. S. Grewal and A. P. Andrews. Applications of Kalman filtering in aerospace 1960 to the present [historical perspectives]. IEEE Contr. Syst. Mag., 30:69–78, 2010.
 [38] M. Dashti and A. M. Stuart. The Bayesian approach to inverse problems. In R. Ghanem, D. Higdon, and H. Owhadi, editors, Handbook of Uncertainty Quantification, pages 311–428. Springer, Cham, Switzerland, 2017.
 [39] D. Calvetti and E. Somersalo. An Introduction to Bayesian Scientific Computing: Ten Lectures on Subjective Computing. Springer, New York, NY, 2007.
 [40] J. P. Kaipio and E. Somersalo. Statistical and Computational Inverse Problems. Springer, New York, NY, 2005.
 [41] R. E. Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng., 82:35–45, 1960.
 [42] G. Evensen. The ensemble Kalman filter for combined state and parameter estimation. IEEE Contr. Syst. Mag., 29:83–104, 2009.
 [43] A. Arnold, D. Calvetti, and E. Somersalo. Parameter estimation for stiff deterministic dynamical systems via ensemble Kalman filter. Inverse Problems, 30:105008, 2014.
 [44] H. U. Voss, J. Timmer, and J. Kurths. Nonlinear dynamical system identification from uncertain and indirect measurements. Int. J. Bifurcation Chaos, 14:1905–1933, 2004.
 [45] A. Arnold. Exploring the effects of uncertainty in parameter tracking estimates for the timevarying external voltage parameter in the FitzHughNagumo model. In P. Nithiarasu, M. Ohta, and M. Oshima, editors, 6th International Conference on Computational and Mathematical Biomedical Engineering. 2019.
 [46] K. Campbell, L. Staugler, and A. Arnold. Estimating timevarying applied current in the HodgkinHuxley model. Applied Sciences, 10:550, 2020.
 [47] H. Moradkhani, S. Sorooshian, H. V. Gupta, and P. R. Houser. Dual stateparameter estimation of hydrological models using ensemble Kalman filter. Adv. Water Resour., 28:135–147, 2005.
 [48] T. Janjic, N. Bormann, M. Bocquet, J. A. Carton, S. E. Cohn, S. L. Dance, S. N. Losa, N. K. Nichols, R. Potthast, J. A. Waller, and P. Weston. On the representation error in data assimilation. Q. J. R. Meteorol. Soc., 144:1257–1278, 2018.
 [49] P. J. van Leeuwen. Representation errors and retrievals in linear and nonlinear data assimilation. Q. J. R. Meteorol. Soc., 141:1612–1623, 2015.
 [50] E Satterfield, D. Hodyss, D. D. Kuhl, and C. H. Bishop. Investigating the use of ensemble variance to predict observation error of representation. Mon. Weather Rev., 145:653–667, 2017.
 [51] F. Hamilton, T. Berry, and T. Sauer. Correcting observation model error in data assimilation. Chaos, 29:053102, 2019.
 [52] T. Berry and J. Harlim. Correcting biased observation model error in data assimilation. Mon. Weather Rev., 145:2833–2853, 2017.
 [53] A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: fifteen years later. In D. Crisan and B. Rozovskii, editors, The Oxford Handbook of Nonlinear Filtering, pages 656–704. Oxford University Press, New York, NY, 2011.
 [54] A. Doucet, N. de Freitas, and N. Gordon, editors. Sequential Monte Carlo Methods in Practice. Springer, New York, NY, 2001.
 [55] M. L. Johnson and L. M. Faunt. Parameter estimation by leastsquares methods. Methods Enzymol., 210:1–37, 1992.
 [56] H. T. Banks, S. Hu, and W. C. Thompson. Modeling and Inverse Problems in the Presence of Uncertainty. CRC Press, New York, NY, 2014.
Comments
There are no comments yet.