1 Introduction
The COVID19 pandemic has led to a maturing of methods for epidemic modeling and forecasting with the CDC establishing the first Center for Forecasting and Outbreak Analytics in 2021. A variety of forecasting innovations in machine learning and deep learning were developed (e.g.,
[Rodríguez and others2021a, Kamarthi and others2021b, Arik and others2020]), with many lessons learned for COVID19 and future pandemics. As the current experience has shown, predicting and preventing epidemics is one of the major challenges with far reaching impacts on health, economy and broad social well being^{1}^{1}1https://www.who.int/activities/preventingepidemicsandpandemics.From this perspective, active participation by several academic and industrial teams (including by coauthors) in multiple CDCled forecasting initiatives has led to two broad themes that are important for epidemic modeling. First, modern disease surveillance has grown by leaps and bounds yielding novel data sources that can shed light into happenings realtime. ML and data science techniques on these data sources provide dramatic improvements in shortterm forecasting (usually 14 weeks ahead). At the same time, as these methods do not explicitly learn mechanistic dynamics, such methods do not provide understanding of how the epidemic will unfold at even longer time horizons, and do not support posing causal and counterfactual questions (e.g., design of countermeasures). Such longerterm forecasting remains the province of mechanistic models that can support scenariobased understanding of epidemic progression (e.g., ”what will happen if I closed schools?”). However, these methods present scalability issues, their calibration is prone to noise, and they have limited capability to ingest multimodal data sources. At the intersection of these two modeling approaches, we have hybrid models that make compartmental models more informed of these data sources
[Shaman and others2013, Arik and others2020]. However, most of these approaches are still restrictive (very few tunable parameters), are not easily generalizable to new models/data sources or do not aim to incorporate ODE dynamics from first principles (e.g. predict ODE parameters instead of the ODE states).In this paper, we develop a general framework for incorporating epidemic dynamics from a mechanistic model into a neural framework for forecasting, which enables seamlessly integration of multimodal data, greater representation power, and inclusion of composable neural modules of learned representations. Our focus is to leverage the selective superiorities of both approaches (see Figure 1) to have predictions that are accurate (low error) and wellcorrelated with even longerterm epidemic trends, than is usually studied in past literature.
Scientific AI [Karniadakis and others2021] is a recent line of work for bridging scientific models (usually represented as differential equations) and ML algorithms. Specifically, the rapidly growing literature in physicsinformed neural networks (PINNs) have demonstrated robust solutions in spite of noise [Yang and others2021] and large scalability improvements thanks to the differentiable nature of ODEs and gradientbased learning [Lu and others2021]. We propose to leverage this body of work to incorporate the dynamics of an mechanistic epidemic model into deep neural models. Our goal requires technical innovations as many of the compartments in epimodels are latent (e.g. the actual number of people exposed to the disease) while most work in PINNs has only experimented with all states of the system dynamics being observable. In addition, PINNs are limited to working with the variables that are described in the mechanistic equations which limits their capabilities to ingest data sources.
(a) Method vs epidemic  (b) Modeling spectrum 
forecasting task  and performance 
Our contributions are as follows: Push the boundaries of datadriven epidemic forecasting: We introduce EpidemiologicallyInformed deep Neural Networks (EINN), a new framework to bridge the gap between mechanistic and neural models for epidemiology by allowing for general integration of auxiliary sources and ode models for better performance over longer horizons. We believe this opens new venues for exploring how AI can better complement domain knowledge in traditional epidemiology. Transfer learning via gradient matching of dynamics: We propose a novel method based on matching ODE gradients to transfer knowledge from one neural network dedicated to learning the dynamics from a mechanistic model to another neural network that ingests a variety of features, which avoids integrating the ODEs. Extensive empirical evaluation: We evaluate our method in the challenging task of weekly COVID19 forecasting in 48 geographical regions and a period of 8 months. Our results showcase that our method can indeed leverage the ‘best of both worlds’ compared to other nontrivial ways of merging such approaches.
2 Related Work
Mechanistic epidemic models. Epidemic mechanistic models [Hethcote2000] like the popular SIR are designed using domain knowledge of the epidemic dynamics. They model causal underpinnings to explain empirically observed variables (e.g., mortality), and ODEbased models have been a workhorse of epidemiology since the late 18th century [Marathe and others2013].
ML epidemic models. There are several successful applications of machine learning to nowcasting and shortterm forecasting [Rodríguez and others2021a, Osthus and others2019, Brooks and others2018, Adhikari and others2019] which led them to be often ranked among the top performing models in these tasks [Cramer and others2021, Reich and others2019]. Some of the recent deep learning innovations include advances in incorporting multiview and multimodal data [Kamarthi and others2021a], spatial correlations [Deng and others2020, Jin and others2021], transfer learning for domain adaptation [Rodríguez and others2021b] and nonparametric approaches [Kamarthi and others2021b, Zimmer and others2020].
Hybrid epidemic models.
There has been recent work aimed at integrating the advantages of mechanistic models and ML approaches. Some lines of work use statistical techniques to estimate the mechanistic parameters (e.g. transmission rate)
[Arik and others2020], learn from simulationgenerated data [Wang and others2019], or use the ODEs as regularization [Gao and others2021]. The most prominent in this line of research is the work by [Shaman and others2013] that integrated search volume data into an SEIR ODE model via the principle of data assimilation [Wang and others2000]. However developing such data assimilation approaches requires a lot of domain knowledge. In summary, most of these hybrid models, in contrast to EINN, cannot be easily generalized to other (possibly more complex) models like for COVID and in presence of novel data sources.Theoryinformed machine learning. This line of work aims to enable ML exploit knowledge of the underlying mechanisms and/or causal relations [Karpatne and others2017]. PINNs have been used for forward and inverse problems with ODEs in a variety of domains including computational biology where data is noisy and samples are irregular [Yazdani and others2020]. However, incorporating exogeneous variables to this framework and working with partially observable systems are largely unexplored problems [Cai and others2021, Wang and others2021]. Our approach EINN is a new class of PINN that directly addresses these two limitations in context of epidemiology. A remotely related but popular line of work for learning dynamical systems is neural ODEs [Chen and others2018] but they cannot incorporate domain ODEs.
3 Buidling Blocks
In this section we describe the basic building blocks that we use in this paper.
3.1 SEIRM mechanistic model
The SEIRM model is a popular compartmental ODE model in epidemiology consisting of the Susceptible, Exposed, Infected, Recovered, and Mortality compartments. It is parameterized by four variables , where is the infectivity rate, is the mean latent period for the disease, is the mean infectious period, and is the mortality rate. Due to COVID19’s prolonged incubation period, the SEIRM has been broadly used in modeling its progression [Wu and others2020, Morozova and others2021]. It has also been used by the CDC in modeling transmission of Ebola [Gaffey and others2018]. To capture the evolving nature of the dynamics and spread of COVID19 (e.g. consider the multiple variant waves), we leverage the dynamic version of the SERIM model, where the parameters governing the disease progression themselves evolve over time. In such a setting, the dynamics is governed by the set of parameters at the given timestamp . Let be the values taken by the states at time
. Then, the Ordinary Differential Equations (ODEs) describing the SEIRM model is given by
:(1)  
Typically ODEbased epidemiological models are calibrated using simulation optimization or Bayesian techniques [Venkatramanan and others2018]. The evolving nature of the models coupled with limited observation, noise and spikes make calibration harder. Recently, physics informed neural networks were shown to be helpful for calibration for ODEs in general [Yazdani and others2020].
3.2 RNN architecture
The Recurrent Neural Network (RNN) encoder has been extensively used in neural epidemic forecasting as a central building block
[Adhikari and others2019, Kamarthi and others2021b, Rodríguez and others2021b, Wang and others2019]. Here, we introduce the base architecture of our model.Informally, at prediction time we are given a multivariate time series of features/signals and we are tasked to predict steps ahead in the future. We encode the feature time series until
by passing it through a Gated Recurrent Unit (GRU)
[Cho and others2014] to obtain a condensed representation for each time step:(2) 
where is the hidden state of the GRU for time step . To capture longterm relations and prevent overemphasis on last terms of sequence we use selfattention layer:
(3) 
where SelfAtten [Vaswani and others2017] involves passing the embeddings into linear layers to extract meaningful similarities before normalizing the similarities using Softmax. Then, we use the attention weights to combine the latent representations and obtain a single embedding representing the time series of features from to : and for the final predictions, we pass over a feedforward network: .
4 Our Approach
4.1 Problem formulation
As mentioned earlier, we aim on harnessing the strengths of both machine learning/deep learning approaches (which have been very successful in shortterm forecasting) and mechanistic models (which are useful for longterm trend projections). Hence, our problem is one of merging neural models with mechanistic model dynamics while maintaining benefits from both the techniques. To capture this specific intention, we modify traditional forecasting problems [Adhikari and others2019, Rodríguez and others2021a, Rodríguez and others2021b] in the following manner:
EpiInformed Neural Forecasting problem Given: A base epidemiological model mathematically represented as a set of ODEs (for example, see the SEIRM model in Section 3.1). A base RNN (See Section 3.2). Data: an observed multivariate time series of COVIDrelated signals and corresponding values for the forecasting target (new COVIDassociated deaths) , where is the first day of the outbreak and is the current date. Predict: next values of the forecasting target, i.e. (here is the size of the forecasting window), such that predictions are accurate and wellcorrelated with the trends of the epidemic curve.
We want to focus on pushing the prediction horizon which has not been studied much: most prior work in this space focuses on shortterm forecasting [Rodríguez and others2021b] where as opposed to in this paper. Note that just minimizing the prediction error as measured by root mean squared error (RMSE) is not enough to ensure the long term trends between forecasts and the actual values are similar. Hence, we focus on matching long term RMSE (accuracy) as well as trends (correlation).
4.2 Motivation and Overview
To tackle the EpiInformed Neural Forecasting problem, one can easily conjure ‘naïve’ approaches. A simple approach is to calibrate the given mechanistic model with the observed target variables, deaths and incidence, and train the base RNN using the generated curve. Similarly, one could also use the ODEs to regularize the neural predictions and could even train an ensemble with nueral network’s and ODEmodel’s predictions. However, as we show in the experiments, while these simple approaches can maintain the predictions of the base RNN model, they are not able to generate wellcorrelated predictions.
Our main idea to overcome the shortcomings of the simple models discussed above is to augment the base encoderdecoder RNN with the latent dynamics inferred by the mechanistic model. Recently, several works [Yazdani and others2020, Karniadakis and others2021] in the field of physics informed Neural Networks and Systems Biology have tried to tie neural networks and mechanistic ODEs for a variety of tasks including ODE parameter inference and forecasting. This line of work assumes that the neural network is a function of single variable and ODE system is in the form describing the rate of change of some function of . This allows for easy computation of the gradient (often implemented using Automatic Differentiation–autograd), which in turn, makes it possible to train the neural network while minimizing the difference between two gradients, e.g. using the term . However, note that our problem asks us to learn a mapping from a multivariate time series of COVIDrelated signals to the the targets . Hence, the strategies employed by existing works do not easily extend to our setting. Predicting ODEs parameters and forward integrating, as in [Arik and others2020], is also not a good fit here as it does not ensure the similarity between the predictions made by the ODEs and the nerual network.
4.3 Time module: learning latent timevarying dynamics
The first major component of EINN is the time module. It interfaces with the set of ODEs describing the epidemic progression to learn the lantent dynamics. We follow PINNs and use a neural network to solve ODE by jointly minimizing the observational error and the residual between gradient given by the ODE and the gradient of the neural network with respect to the time inputs (computed via automatic differentiation). Our time module is a feedforward network that ingests time as input and outputs , but except for cumulative deaths , where we predict new deaths to match the reported data.
where are the learned ODE parameters for time . We minimize the ODE loss (unsupervised loss) while fitting the observed data (supervised loss).
(4)  
(5) 
where is new deaths predicted by the time module.
Constraining the optimization via epi domain knowledge. In contrast to the setting in most prior work in PINNs where most states of the system dynamics are observed, most states in our SEIRM model are latent. Learning in such a scenario is extremely challenging. We alleviate this by infusing additional epidemiological domain knowledge in the form of monotonocity constraints. We adapt monotonicity losses from [Muralidhar and others2018] to our setting. They proposed to penalize consecutive predictions if they are not monotonic in the required direction. Note that the difference between consecutive predictions are discrete approximation to the derivatives. Here, we generalize this loss to continuous derivatives (this can be achieved by taking the limit ). For our specific ODE described above, we know that the Susceptible state monotonically decreases and the Recovered state monotonically increases (these are easy to derive for other epidemiological models as well). Now, we add a penalty when is positive and when is negative.
(6) 
where is the rectified linear function. Note that and are part of , which is the output of the time module; thus, and are computed via autograd.
Coping with spectral bias in neural networks. One of the central issue in fitting PINNs is the spectral bias of neural networks, which is the tendency of neural networks to fit low frequency signals [Wang and others2021]. To overcome this spectral bias, usually the neural networks are given more flexibility to fit high frequency systems. Here, we adopted Gaussian Random Fourier feature mappings [Tancik and others2020]: , where each entry in is sampled from , where
is a hyperparameter.
Handling timevarying ODE parameters. As mentioned earlier, our ODE model is time varying, therefore we have to learn mechanistic parameters for each time step, which increases the difficulty of the optimization. To make this more tractable, we propose a consistency loss between consecutive parameters.
(7) 
4.4 Feature module: connecting features to epidemic dynamics via gradient matching
The second component of our model is the feature module, which maps the multivariate timeseries of COVID19 related signals to the forecasting targets . Here, we want to ensure that the predictions made by the feature module are consistent with the ones given by the ODE model. Hence, we want the feature module to capture the latent dynamics learned by time module. By succeeding on this, we will connect features to the epidemic dynamics of our SEIRM. We can formalize our goal as follows:
where are the ODE states predicted by the feature module and are the same ODE parameters used by in the time module.
Matching the ODE gradient. We cannot directly calculate via autograd from the inputs as we did for the time module because our base RNN ingests features. We propose to use internal representations (embeddings) so that we can approximate the gradient to an expression that can be computed via autograd directly. Let and
be embeddings for the time module and feature module. Then, by using the chain rule, we propose to approximate the gradient of
assuming and have our gradient trick:(8) 
where can be calculated in the feature module using autograd because is the only variable that is needed to compute . Similarly, is the only input needed for computing , thus, we can use autograd to compute . To make this approximation valid, we have to make these embeddings and similar. We do this with the following loss:
(9) 
This derivation allow us to make the feature module to learn the gradients learned by the time module by minimizing and ODE loss for the feature module:
(10) 
Aligning with data and time module outputs. Matching the ODE gradient is not enough to ensure the dynamics will be transferred. We have to make sure that the feature module outputs are aligned with data and with the ODE dynamics as found by the time module. For fitting the data, we define data loss in a manner similar to the time module:
(11) 
where is the new deaths prediction of the feature module. To align the time and feature modules, we use knowledge distillation (KD) [Ba and others2014], a popular transfer learning method. We impose our KD loss on the outputs of these two modules:
(12) 
Note that our time module is able to predict for any given time but our base RNN makes prediction for one target in the future. To align these two, we make our feature module to make joint prediction using a decoder GRU which takes as the initial hidden state and roll the GRU forward for every prediction step ahead taking time as input. Thus, our decoder equations will be and our final predictions .
4.5 Model training and inference
Training. Step 1: During the first training step, our goal is to make so that later we can use the gradient approximation stated in Equation (8) so that we can then match the gradient of the ODE. For this, we we can train all parameters jointly with all the losses except for the feature ODE loss . Step 2: Once is small, we can train all losses together, however, it might be unstable and may start to increasing which in turn invalidates our gradient matching trick and makes the minimization of misleading. We found the training is more stable when freezing all previous layers to in the time module and all previous layers to in the feature module. In this case, we only focus the learning in the last layers, therefore, they should have enough representation power for this task.
Inference. At inference, although we have predictions from both time and feature modules, we solely utilize the feature module predictions as it ingests features and we want to emphasize the utility of inserting dynamics in ML models.
5 Experiments
5.1 Setup
All the results are for forecasting COVID19 mortality in the US up to 8weeks ahead in the future. Our code is available for research purposes^{2}^{2}2https://github.com/AdityaLab/EINNs. We evaluate in 47 states excluding 3 states where the SEIRM mechanistic model had difficulties fitting due to sparsity of death counts (specifically Alaska, Montana, and Wyoming). Our evaluation period is 8 months from Sept. 2020 to March 2021 (specifically CDC epidemic weeks 202036 to 202109) which includes the complete Delta wave in the US, and we make 8weeks ahead predictions for every two weeks in this period. We used June 2020 to Aug. 2020 to tune our models. For each forecasting week, all models are trained with historical data available until that week.
Shortterm (14 wks)  Longterm (58 wks)  Trend correlation  
Model  NRMSE1  NRMSE2  ND  NRMSE1  NRMSE2  ND  Pearson 
RNN  1.09  0.50  0.86  1.19  0.53  0.96  0.08 
SEIRM  2.35  1.13  1.36  7.14  2.99  3.11  0.53 
Generation  0.79  0.35  0.60  0.93  0.40  0.74  0.01 
Regularization  1.05  0.48  0.81  1.19  0.53  0.97  0.09 
Ensembling  0.91  0.41  0.68  0.93  0.40  0.69  0.01 
PINN (time module standalone)  0.84  0.38  0.64  0.93  0.40  0.72  0.24 
EINNNoGradMatching  0.82  0.36  0.61  0.89  0.38  0.68  0.04 
EINN  0.54  0.24  0.38  0.85  0.37  0.66  0.46 
Metrics. Our focus is in predictions that are accurate and wellcorrelated the epidemic trends, thus we measure two aspects of our predictive performance: error and trend correlation. Error metrics: As previous work in this area [Adhikari and others2019], we adopt metrics based on root mean squared error and absolute deviation. Because the number of deaths largely vary across regions, we use normalize versions of popular error metrics so that we can aggregate performance over multiple regions. We use two different versions of Normalized Root Mean Squared Error (NRMSE) and Normal Deviation (ND) following [Roy and others2021, Remy and others2021]. For all of these metrics, we calculate them at shortterm forecasting (14 weeks) and longterm forecasting (58 weeks) and calculate their mean value. Correlation metric: Following [Deng and others2020], we use Pearson correlation and use their median across weeks.
Data. We collected important publicly available^{3}^{3}3Data links: apple.com/covid19/mobility; google.com/covid19/mobility; coronavirus.jhu.edu; healthdata.gov; delphi.cmu.edu; gis.cdc.gov/grasp/COVIDNet/COVID19_3.html signals from a variety of trusted sources that are relevant to COVID19 forecasting. The features in this dataset are mobility from Google and Apple, social media surveys from Facebook, hospitalization data from the U.S. Department of Health & Human Services and CDC, and cases and mortality from Johns Hopkins University. In total, we collected 13 features.
Baselines. As we are the first to pose the EpiInformed Neural Forecasting problem, we do not have offtheshelf baselines. Instead, we adopt different ways to incorporate ODE dynamics into a neural model that have been explored in literature [Dash and others2022]. Generation: Similar to [Wang and others2019, SanchezGonzalez and others2020], the NN learns directly from data generated by the analytical solution of SEIRM. Regularization: Similar to [Gao and others2021, Gaw and others2019], the NN predicts both the ODE states and the ODE parameters. Then uses the ODE parameters to regularize the states via a loss based on the SEIRM equations. Ensembling: As per [Adiga and others2021, Yamana and others2017], combines predictions from RNN and SEIRM via a NN that outputs final predictions. Note that, unlike Generation and Ensembling, our method does not need integrating the ODE.
5.2 Results
Q1: Leveraging advantages of both mechanistic models and neural models. Our RNN has a lower error in short and longterm forecasting than the SEIRM, but its predictions are much less correlated with epidemic trends (see comprehensive results in Table 1). By integrating ODE and neural models, EINN is capable of taking advantage from both. Comparing with the ODE, our Pearson correlation is similar but our predictions are much more accurate (up to 77% less error). Indeed, EINN not only improves RNN correlation by 475% but also its accuracy up to 55% thanks to the incorporation of short and longterm dynamics.
Q2: Benefits over other ways to incorporate epidemic dynamics into neural models. EINN has the lowest error and best correlation in comparison with other existing ways to incorporate epidemic dynamics to neural networks. While we can see that all of them have better short and longterm error than the RNN (especially Generation and Ensembling), their correlation is worse or the same. Instead, EINN across 48 different regions (5376 predictions per model) shows significant improvements in both accuracy and correlation by a large difference.
Q3: Ablation: time module PINN and gradient matching. We perform ablation studies to understand what are the contributions of the main components of our model. First, we analyze our time module trained standalone, i.e., being trained without the feature module with losses in Equations (812) (PINN in Table 1). We can see that, although our time module PINN directly interacts with the ODE and their behavior will be coupled during training, they have different behavior in test. In fact, this points to the need that we need features to be able to extract representations that generalize in test.
Second, we assess the contribution of our gradient matching trick (EINNNoGradMatching), for which we train with all losses except for the ones in Equations (12) and (13). In this scenario where only helps to transfer the dynamics, we can see that it is a less effective way.
Q4: Case study in Delta wave: US National and New York state. In Figure 2, we exemplify the advantages of EINN (in red) in both accuracy and correlation. The RNN (blue) is accurate in short term forecasting, but we can see how the longterm predictions (see US National when the epidemic curve is declining). In addition, ODE predictions are wellcorrelated but may incur in large errors. EINN takes the best from these two.
Q5: Sensitivity to hyperpameters. We found most hyperparameters are not sensitive. See Appendix for more details.
6 Conclusions
We introduced a novel approach EINN to incorporate mechanistic dynamics (via the SEIRM model) into neural models (using a RNN style base model). We show the effectiveness of a principled method to transfer relevant knowledge via gradient matching of the ODE equations, without integrating (forward pass) the model. Through extensive experiments over 48 regions we also show the usefulness of EINN in COVID19 forecasting and also the importance of our various design choices. Overall, we believe this work opens up new avenues for leveraging epi domain knowledge into AI models for better decision making. Connecting complex mechanistic models to neural networks also enables us to have learned representations useful for other tasks downstream like whatif predictions, which would be worth exploring. In addition investigating more complex epidemiological models (like network based agent models) would be fruitful.
Acknowledgments: This work was supported in part by the NSF (Expeditions CCF1918770, CAREER IIS2028586, RAPID IIS2027862, Medium IIS1955883, Medium IIS2106961, CCF2115126), CDC MInD program, ORNL, faculty research award from Facebook and funds/computing resources from Georgia Tech. BA was supported by CDCMIND U01CK000594 and startup funds from University of Iowa. NR was supported by US NSF grants Expeditions CCF1918770, NRT DGE1545362, and OAC1835660.
References
 [Adhikari and others2019] Bijaya Adhikari et al. EpiDeep: Exploiting Embeddings for Epidemic Forecasting. In KDD, 2019.
 [Adiga and others2021] Aniruddha Adiga et al. All Models Are Useful: Bayesian Ensembling for Robust High Resolution COVID19 Forecasting. preprint, Epidemiology, 2021.
 [Arik and others2020] Sercan O. Arik et al. Interpretable Sequence Learning for COVID19 Forecasting. arXiv, 2020.
 [Ba and others2014] Jimmy Ba et al. Do deep nets really need to be deep? ANIPS, 2014.
 [Brooks and others2018] Logan C. Brooks et al. Nonmechanistic forecasts of seasonal influenza with iterative oneweekahead distributions. PLOS Comp. Biology, 2018.
 [Cai and others2021] Shengze Cai et al. Deepm&mnet: Inferring the electroconvection multiphysics fields based on operator approximation by neural networks. Journal of Comp. Physics, 2021.
 [Chen and others2018] Ricky TQ Chen et al. Neural ordinary differential equations. In NeurIPS, 2018.
 [Cho and others2014] Kyunghyun Cho et al. Learning phrase representations using rnn encoderdecoder for statistical machine translation. 2014.
 [Cramer and others2021] Estee Y Cramer et al. Evaluation of individual and ensemble probabilistic forecasts of COVID19 mortality in the US. medRxiv, 2021.
 [Dash and others2022] Tirtharaj Dash et al. A review of some techniques for inclusion of domainknowledge into deep neural networks. Scientific Reports, 2022.
 [Deng and others2020] Songgaojun Deng et al. ColaGNN: Crosslocation Attention based Graph Neural Networks for Longterm ILI Prediction. In CIKM, 2020.
 [Gaffey and others2018] Robert H Gaffey et al. Application of the cdc ebolaresponse modeling tool to disease predictions. Epidemics, 2018.
 [Gao and others2021] Junyi Gao et al. STAN: spatiotemporal attention network for pandemic prediction using realworld evidence. JAMA, 2021.
 [Gaw and others2019] Nathan Gaw et al. Integration of machine learning and mechanistic models accurately predicts variation in cell density of glioblastoma using multiparametric mri. Scientific reports, 2019.
 [Hethcote2000] Herbert W. Hethcote. The Mathematics of Infectious Diseases. SIAM Review, 2000.

[Jin and others2021]
Xiaoyong Jin et al.
InterSeries Attention Model for COVID19 Forecasting.
In SDM. SIAM, 2021.  [Kamarthi and others2021a] Harshavardhan Kamarthi et al. CAMul: Calibrated and Accurate Multiview TimeSeries Forecasting. arXiv, 2021.
 [Kamarthi and others2021b] Harshavardhan Kamarthi et al. When in Doubt: Neural NonParametric Uncertainty Quantification for Epidemic Forecasting. In NIPS, 2021.
 [Kamarthi and others2022] Harshavardhan Kamarthi et al. Back2future: Leveraging backfill dynamics for improving realtime predictions in future. ICLR, 2022.
 [Karniadakis and others2021] George Karniadakis et al. Physicsinformed machine learning. Nature Reviews Physics, 2021.
 [Karpatne and others2017] Anuj Karpatne et al. TheoryGuided Data Science: A New Paradigm for Scientific Discovery from Data. TKDE, 2017.

[Liu and others2021]
Xiao Liu et al.
Selfsupervised learning: Generative or contrastive.
TKDE, 2021.  [Lu and others2021] Lu Lu et al. Deepxde: A deep learning library for solving differential equations. SIAM Review, 2021.
 [Marathe and others2013] Madhav Marathe et al. Comp. epidemiology. Comm. of the ACM, 2013.
 [Morozova and others2021] Olga Morozova et al. One year of modeling and forecasting covid19 transmission to support policymakers in connecticut. Scientific reports, 2021.
 [Muralidhar and others2018] Nikhil Muralidhar et al. Incorporating Prior Domain Knowledge into Deep Neural Networks. In ICBD, 2018.
 [Osthus and others2019] Dave Osthus et al. Dynamic Bayesian Influenza Forecasting in the United States with Hierarchical Discrepancy (with Discussion). Bayesian Analysis, 2019.
 [Reich and others2019] Nicholas G. Reich et al. A collaborative multiyear, multimodel assessment of seasonal influenza forecasting in the United States. PNAS, 2019.
 [Remy and others2021] Sekou L Remy et al. Overcoming digital gravity when using ai in public health decisions. arXiv, 2021.
 [Rodríguez and others2021a] Alexander Rodríguez et al. DeepCOVID: An Operational Deep Learningdriven Framework for Explainable Realtime COVID19 Forecasting. In AAAI, 2021.
 [Rodríguez and others2021b] Alexander Rodríguez et al. Steering a Historical Disease Forecasting Model Under a Pandemic: Case of Flu and COVID19. In AAAI, 2021.
 [Roy and others2021] Padmaksha Roy et al. Deep diffusionbased forecasting of covid19 by incorporating networklevel mobility information. In CASSNAM, 2021.
 [SanchezGonzalez and others2020] SanchezGonzalez et al. Learning to simulate complex physics with graph networks. In ICLR. PMLR, 2020.
 [Shaman and others2013] Jeffrey Shaman et al. Realtime inﬂuenza forecasts during the 20122013 season. Nature Comm., 2013.
 [Tancik and others2020] Matthew Tancik et al. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In NIPS, 2020.
 [Vaswani and others2017] Ashish Vaswani et al. Attention is all you need. NIPS, 2017.
 [Venkatramanan and others2018] Srinivasan Venkatramanan et al. Using datadriven agentbased models for forecasting emerging infectious diseases. Epidemics, 2018.
 [Wang and others2000] Bin Wang et al. Data assimilation and its applications. PNAS, 2000.
 [Wang and others2019] Lijing Wang et al. DEFSI: Deep Learning Based Epidemic Forecasting with Synthetic Information. AAAI, 2019.
 [Wang and others2021] Sifan Wang et al. Understanding and mitigating gradient flow pathologies in physicsinformed neural networks. SIAM, 2021.
 [Wu and others2020] Joseph T Wu et al. Nowcasting and forecasting the potential domestic and international spread of the 2019ncov outbreak originating in wuhan, china: a modelling study. The Lancet, 2020.
 [Yamana and others2017] Teresa K Yamana et al. Individual versus superensemble forecasts of seasonal influenza outbreaks in the united states. PLoS Comp. Bio, 2017.
 [Yang and others2021] Liu Yang et al. Bpinns: Bayesian physicsinformed neural networks for forward and inverse pde problems with noisy data. Journal of Comp. Physics, 2021.
 [Yazdani and others2020] Alireza Yazdani et al. Systems biology informed deep learning for inferring parameters and hidden dynamics. PLOS Comp. Bio, 2020.
 [Zimmer and others2020] Christoph Zimmer et al. Influenza Forecasting Framework based on Gaussian Processes. In ICML, 2020.
Appendix
Code and data are publicly available^{4}^{4}4https://github.com/AdityaLab/EINNs. Please refer to the README in code repository for more details.
Shortterm (14 wks)  Longterm (58 wks)  Trend correlation  
Hyperparameter  Value  NRMSE1  NRMSE2  ND  NRMSE1  NRMSE2  ND  Pearson 
1  0.41  0.50  0.35  0.52  0.73  0.48  0.68  
10  0.38  0.46  0.32  0.51  0.70  0.47  0.72  
100  0.34  0.39  0.29  0.57  0.73  0.50  0.55  
1  0.38  0.46  0.32  0.51  0.70  0.47  0.72  
10  0.39  0.46  0.33  0.53  0.70  0.49  0.75  
100  0.39  0.46  0.34  0.53  0.71  0.49  0.76 
Appendix A Experimental setup (extra details)
Computational setup.
All experiments were conducted using a 4 Xeon E74850 CPU with 512GB of 1066 Mhz main memory and 4 GPUs Tesla V100 DGXS 32GB. Our method implemented in PyTorch (implementation details in the appendix) trains on GPU in about 30 mins for one predictive task. Inference is takes only a few seconds.
Realtime forecasting. We follow the literature on evaluating epidemic forecasting methodologies [Shaman and others2013, Kamarthi and others2021a, Adhikari and others2019] and use the realtime forecasting setup. We simulate realtime forecasting by making models train only using data available until each of the prediction weeks and make predictions for 1 to 8 weeks ahead in the future. Data revisions in public health data are large and may affect evaluation and conclusions [Kamarthi and others2022, Cramer and others2021], therefore, we utilize fully revised data following previous papers on methodological advances [Adhikari and others2019, Rodríguez and others2021b].
Evaluation details. Some models may make daily predictions while others weekly predictions. We follow CDC evaluation papers [Cramer and others2021, Reich and others2019] and convert all forecasts to weekly.
Appendix B Implementation details
b.1 Data Preprocessing
Feature scaling. Timeseries of exogenous features can have wide range of values (e.g., number of confirmed cases vs percentage of people with COVIDlike symptoms in social media surveys) Therefore, we scale all signals per each region for which we use standard scaling (normalization).
Time series for training. Although we may have long time series, we found our RNN works better with no chunking. As during training we have variablelength input sequences, we use a mask that is utilized when calculating the attention scores of Equation (3). As we follow the realtime forecasting setup, at inference time we use the complete input sequence thus we do not need a mask.
b.2 Architecture and hyperparameters
We describe in detail the hyperparameters for EINN used for our experiments. As mentioned in Section 5, we used data from June 2020 to Aug. 2020 for model design and hyperparameter tuning.
Time module:
It is a feedforward network with input layer layers 40x40x40x20 followed by output layers 20x40x40x5, and activation function
. Note that the input to this module is time, which is of dimension 1, but this is immediately transformed to 40 different signals via Gaussian Random Fourier feature mapping which then enter to the neural network. Between the input and output layers, we have our embeddings , which are of smaller size (20) than the other hidden layers. We make this selection because we want to make embeddings and to be as close as possible, and this is hard to achieve when we deal with highdimensional embeddings as it has been noted in the Contrastive Learning literature [Liu and others2021]. Passing to the output layer gives us . With respect to the learnable ODE parameters, we a transformation following [Yazdani and others2020]. This ensures that the actual parameter values will be within their corresponding domain (01). In our experience, we found that initializing the ODE parameters with the ones found by the analytical solution works best.Feature module: As encoder, we used a bidirectional GRU with 2 layers and hidden states of dimension 32. As decoder, we use another bidirectional GRU with 1 layer and hidden states of dimension 32 and an feedforward output layer of 32x20 to obtain . To encourage transfer learning, we utilize shared layers between the time and feature modules. Therefore, the embedding is passed to the output layer of the time module to obtain .
Loss weights: All our results use loss weight of 1 except for the following losses. For our ODE loss we use weight loss of 10, and we weight the same in our monotonicity loss . Our parameter consistency loss is weighted with . Finally, we have a helper loss for the time module that ingests data from the analytical solution of the ODE, to which we put a weight of
b.3 Hyperparameter sensitivity
Overall, we found that most hyperparameters are not sensitive. The most sensitive ones are the weights for ODE loss for time module and feature module , and embedding and output losses . To illustrate this, we vary the loss weight which is applied to both and and another loss weight which is applied to both and .
We analyze hyperparameters sensitivity on five geographical regions in the uptrend of the Delta wave (over 2 months, specifically epidemic weeks 202048 to 202101) which is one of the most difficult to predict due to the unprecedented infectiousness of the Delta variant and shift in human behavior. In Table 2, we can see that EINN performance is stable across different values of hyperparameters It is worth noting that it tuning is important to have a correct balance on these losses as increasing one weight on may improve error or correlation but degrade the one facet. As noted in the literature [Wang and others2021], this is common when working with theorybased constrains and how to best proceed remains an open problem.
Appendix C Evaluation metrics
As noted in Section 5, we used two different versions of Normalized Root Mean Squared Error (NRMSE) and Normal Deviation (ND) following [Roy and others2021, Remy and others2021]. Given prediction at prediction week for weeks ahead in the future, and the corresponding ground truth value , our error metrics are the following:
where is the number of predictions weeks and is number of weeks ahead in the future; for our setup we have and , which makes 112 different predictions for model for a single region and makes 5376 predictions per model over all regions.
Regarding to correlation, we use Pearson correlation over the sequence of 8 predictions in the future as per [Deng and others2020]:
where and are the mean values of the ground truth and the model’s predictions, respectively.