Recent years have seen an increased interest in data-driven methods with the aim to develop cheap surrogate models to perform forecasting of dynamical systems. A main driver behind this endeavour is the potential saving of computational running time required for simulation. This is particularly important for stiff multi-scale systems for which the fastest time-scale puts strong restrictions on the time step which an be employed (Han et al., 2018, 2019, Raissi and Karniadakis, 2018, Raissi et al., 2019). Promising applications for such surrogate models come from atmospheric and ocean dynamics and from climate dynamics (Schneider et al., 2017, Dueben and Bauer, 2018, Rasp et al., 2018, Gagne II et al., 2020, Bolton and Zanna, 2019, Brajard et al., 2021, Nadiga, 2020, Cleary et al., 2021).
A particularly simple and efficient machine learning architecture are random feature maps (Rahimi and Recht, 2008, 2008, Bach, 2017a, b, Sun et al., 2019). Random feature maps provide a representation of the surrogate propagator for a dynamical system by a linear combination of randomly generated high-dimensional nonlinear functions of the input data. The training of random feature map networks only requires least-square regression and it was proven rigorously that random feature maps enjoy the so called universal approximation property which states that they can approximate any continuous function arbitrarily close (Park and Sandberg, 1991, Cybenko, 1989, Barron, 1993). The framework of random feature maps was extended to include internal dynamics in so called echo-state networks and reservoir computers with remarkable success in forecasting dynamical systems (Maass et al., 2002, Jaeger, 2002, Jaeger and Haas, 2004, Pathak et al., 2018, Jüngling et al., 2019, Algar et al., 2019, Nadiga, 2020, Bollt, 2021, Gauthier et al., 2021, Platt et al., 2021).
The skill of random feature maps in providing a reliable surrogate forecast model is however severely impeded when the data used for training are contaminated by measurement noise. In previous work Gottwald and Reich (2021) showed that by combining random feature maps with data assimilation, in a framework coined RAndom Feature map Data Assimilation (RAFDA), remarkable forecast skills could be obtained for noisy observations. Moreover, RAFDA was shown to be able to provide model closure in multi-scale systems allowing for subgridscale parametrization as well as to provide reliable ensembles for probabilistic forecasting. In RAFDA the parameters of the random feature maps are learned sequentially with incoming observations in conjunction with the analysis of the state variables. Data assimilation has recently also been applied to enhance the capabilities of other traditional machine-learning algorithms (Tomizawa and Sawada, 2021, Wikner et al., 2021).
In this work, we extend RAFDA to the situation when only partial observations are available. In this case, the propagator map, that updates the system in the partially observed subspace only, is non-Markovian and knowledge of the past history is required. This problem was recently addressed in Levine and Stuart (2021) within a machine learning context using appropriate model closures. To account for the non-Markovian nature of the propagator map, we follow here a different path instead and employ phase-space reconstruction and Takens’ embedding theorem (Takens, 1981). Takens’ embedding theorem has successfully been used in data assimilation before where a forecast model was non-parametrically constructed using analogs (Hamilton et al., 2016). Here we use delay coordinates to learn a surrogate forecast model from noisy partial observations in the reconstructed phase-space.
The paper is organised as follows. In Section 2 we develop our RAFDA methodology for partial observations. In Section 3 we show how RAFDA performs on the Lorenz 63 system and that RAFDA exhibits significantly increased forecast skill when compared to standard random feature maps trained on delay coordinates. We close in Section 4 with a discussion and an outlook.
2. Random feature map and data assimilation (RAFDA)
Consider a -dimensional dynamical system which is accessed at equidistant times of interval length , , by partial noisy observations
with , observation operator , measurement error covariance matrix and
-dimensional independent and normally distributed noise, that is, .
The aim we set out to pursue is the following: using noisy observations for find an approximation to the propagator map which maps previous observations (or a suitable subset of them) to the unobserved variable at future time . In the Markovian case with the propagator map is only a function of the observation at the current time . This case was studied in our previous work (Gottwald and Reich, 2021). Here the focus is on partial observations with
for which the propagator map is non-Markovian. Our method judiciously incorporates three separate methods which we discuss in the following three subsections: 1.) Takens embedding and delay coordinates to deal with the aspect of having only access to partial observations and the resulting non-Markovian propagator map, 2.) random feature maps which form our approximation for the propagator map and 3.) data assimilation and Ensemble Kalman Filters (EnKFs) to control the observation noise and to determine the parameters of the random feature maps leading to the desired surrogate model.
2.1. Delay coordinates
When only partial observations are available Takens’ embedding theorem allows to represent the dynamics faithfully by means of phase space reconstruction and delay vectors(Takens, 1981, Sauer et al., 1991, Sauer and Yorke, 1993). Takens’ embedding theorem assumes noise-free observations, but has been successfully applied to noise-contaminated observations (Kantz and Schreiber, 2004, Schreiber, 1999, Small, 2005, Casdagli et al., 1991, Schreiber and Kantz, 1995).
We describe here the key steps for scalar-valued observations, that is, in (1). Hence, given a time-series , define the -dimensional delay vectors
for , with integer delay time . If the underlying dynamics lies inside an -dimensional phase space, then the delay reconstruction map is an embedding provided the embedding dimension is sufficiently large with where denotes the fractal box-counting dimension of the attractor (Sauer et al., 1991, Sauer and Yorke, 1993). Under these conditions, the reconstructed dynamics in faithfully represents the underlying dynamics. For finite time series, the choice of the embedding dimension and the delay time are crucial and need to be carefully chosen. See for example Kantz and Schreiber (2004), Schreiber (1999), Casdagli et al. (1991), Schreiber and Kantz (1995), Small (2005) for a detailed discussion. Specifically, if the embedding dimension is chosen too small, then the delay reconstruction map does not represent the underlying dynamics, if it is chosen too large, the reconstruction involves redundancy reducing the length of the delay vector time series . We choose the embedding dimension using the false nearest neighbour algorithm (Kennel et al., 1992). The embedding dimension is chosen as the smallest value such that the number of false nearest neighbours is less than . False nearest neighbours (at dimension ) are defined as those points for which the Euclidean distance to their nearest neighbour changes upon increasing the embedding dimension to by a factor of (relative to their smallest Euclidean distance). An appropriate delay time is determined by determining the time for which the average mutual information has its first zero-crossing (Fraser and Swinney, 1986). We use the command phaseSpaceReconstruction from Matlab (MATLAB, 2020) which implements these algorithms to determine the embedding dimension , the delay time as well as generating the delay vectors.
2.2. Random feature maps
. This is a particularly easy-to-implement network architecture in which the input is given as a linear combination of randomly sampled nonlinear functions of the input signal, and the coefficients of this linear combination are then learned from the whole available training data set via linear ridge regression. Our RAFDA extension, as described further below, instead determines the coefficients sequentially within a data assimilation procedure.
Applied to our setting, the delay coordinates are first linearly mapped by a random but fixed linear map into a high-dimensional subspace of with and then nonlinearly transformed by feature maps as
with weight matrix
and a bias
The weight matrix and the bias are chosen randomly and independently of the observed delay coordinates , , according to the distributions and , respectively. We choose here
The choice of the hyper-parameters and is system-dependent and needs to ensure that the observations cover the nonlinear domain of the -function. The reader is referred to Gottwald and Reich (2021) for further details. Note that the hyper-parameters and are kept fixed once drawn and are not learned. This restriction is made for computational simplicity and can be relaxed following E et al. (2020), Rotskoff and Vanden-Eijnden (2018).
An approximation of the propagator map in the delay coordinates is then provided by
where the matrix maps back into the delay coordinate vector space to approximate the delay vector at the next time step, and is to be learned from the noisy delay vectors , . More specifically, the weights should be chosen such that
which is achieved by minimizing the regularized cost function
where denotes the Frobenius norm of a matrix , is the matrix with columns , , and is the matrix with columns
. The parameter is used for regularization. The solution to the minimization problem for (7) can be explicitly determined as
and uses all available training data at once. In Gottwald and Reich (2021) it was shown that standard random feature maps have difficulty dealing with noisy observations, and instead it was proposed to learn the output weights sequentially within a data assimilation procedure which is described in the next section.
2.3. Data assimilation
RAFDA uses a combined parameter and state estimation within a data assimilation procedure. The main idea is to use the surrogate propagator (5) consisting of the random feature maps as the forecast model within an EnKF and to estimate sequentially the weight matrix . Concretely, we consider the forecast model for given weight matrix and given delay coordinates as
where the superscript f denotes the forecast and the superscript a denotes the analysis defined below.
To incorporate the parameter estimation into an EnKF analysis step, we consider an augmented state space with . Here is the vector consisting of all matrix elements of the weight matrix with its entries defined block-wise, that is, , and so forth.
We now also treat and ,, the analysis step for the mean is given by
with the observation matrix defined by and the measurement error covariance matrix of the delay vectors given by (compare (1)). Here denotes the Kronecker product of two matrices. The Kalman gain matrix is given by
with forecast covariance matrix
The angular bracket denotes the expectation value and the hat denotes the perturbation of from its mean , as in
Since , we can separate the state and parameter update of the Kalman analysis step (11) as
and covariance matrices and . Note that the delay vectors are correlated in time as after each steps components reappear. This implies that the setting described above is not strictly Bayesian. However, if the delay time is sufficiently large the dynamics of the combined forecast-analysis dynamical system will have sufficiently decorrelated and for practical purposes we can treat the observations as independent.
To implement the Kalman analysis step, we employ a stochastic EnKF (Burgers et al., 1998, Evensen, 2006). This allows to estimate the forecast covariances adapted to the dynamics and has advantages for the nonlinear forward model (10) and the non-Gaussian augmented state variables. Consider an ensemble of states consisting of members , , that is,
with empirical mean
and associated matrix of ensemble deviations
Ensembles for the forecast are denoted again by superscript f and those for the analysis by superscript a. In the forecast step each ensemble member is propagated individually using (10), updating the previous analysis ensemble to the next forecast ensemble . The forecast covariance matrix (13) used in the analysis step (11) can be estimated as a Monte-Carlo approximation from the forecast ensemble deviation matrix via
In the stochastic ensemble Kalman filter observations receive a stochastic perturbation , , drawn independently from the Gaussian observational noise distribution . The associated ensemble of perturbed observations is given by
The EnKF analysis update step is then given by
To mitigate against finite ensemble size effects covariance inflation with is typically introduced with (Anderson and Anderson, 1999). Such multiplicative inflation preserves the ensemble mean but increases the forecast error covariance. In Gottwald and Reich (2021), which considers the fully observed case with , localisation was employed by only considering covariances between a component of the state variable and its associated block in the parameter . This localisation was found here not to be advantageous; we believe that this is due to the fact that each component of the delay vectors appears in all components at different times. We therefore do not perform any localisation in the numerical simulations presented below.
The initial ensemble is initialized as and where is the vectorial form of the solution to the ridge regression formulation (7) and is a parameter specifying the spread of the initial ensemble.
The ensemble forecast step defined by (10), together with the EnKF analysis step (22) constitute our combined RAFDA method. We run RAFDA for a single long training data set of length . The approximation of the propagator map is given by the random feature model (5) where the weight matrix is given by the ensemble mean of the weight matrix at final training time and is denoted by .
We summarize our RAFDA method in Algorithm 1.
3. Numerical results
We consider the Lorenz-63 system (Lorenz, 1963)
with . Observations are only available for , i.e. and in (1), and are taken every time units. We employ observational noise with measurement error covariance and use unless stated otherwise (compare (1)). We ensure that the dynamics evolves on the attractor by discarding an initial transient of model time units. The optimal embedding dimension and delay time for these parameters are estimated as and . In the following times are measured in units of the Lyapunov time with the maximal Lyapunov exponent .
We use a reservoir of size with internal parameters and , and an ensemble size of for the ensemble Kalman filter. The regularization parameter is set to and the inflation parameter to . We employ a training set of length .
To test the propensity of RAFDA to learn a surrogate model (5) we generate a validation data set , sampled with the same rate as the training data set with from the -component of the Lorenz-63 system (24). To quantify the forecast skill we measure the forecast time , defined as the largest time such that the relative forecast error , where the are generated by the learned surrogate model. We choose here .
The forecast skill depends crucially on the choice of the randomly chosen internal parameters as well as on the training and validation data set. We therefore report on the mean behaviour over realisations, differing in the training and validation data set, in the random draws of the internal parameters and in the initial ensembles for RAFDA. Each training and validation data set is generated from randomly drawn initial conditions which are evolved independently over model time units to ensure that the dynamics has settled on the attractor.
In the following, we compare RAFDA with standard random feature maps using linear regression (LR). Figure1 shows the empirical histogram for both RAFDA and LR when the surrogate model (5) is trained on data contaminated with noise of strength . The mean forecast time obtained by RAFDA is and is roughly three times larger than the forecast time obtained using standard random feature maps which yields
. Furthermore, it is seen that the distribution of forecast times for RAFDA is heavily skewed towards larger forecast times. As expected, these forecast times are smaller than for the fully observed case with, as considered in Gottwald and Reich (2021), where extreme values of were reported. Figure 2 shows a typical example of for a forecast time of , which is close to the mean forecast time. Figure 3 shows that the surrogate model (5) obtained using RAFDA produces a model which very well approximates the long-time statistics of the full Lorenz system (24) in the sense that its trajectories reproduce the attractor in delay-coordinate space. Contrary, LR does not lead to surrogate models which are consistent with the dynamics of the Lorenz system (24). Reproducing dynamically consistent trajectories is, of course, paramount for the purpose of using such surrogate models for long-time integration when the interest is rather on the overall statistical behaviour rather than on accurate short-term forecasts.
The ability of RAFDA to control noise present in a training data set depends on the strength of the noise. We show results of the mean forecast time for a range of . RAFDA outperforms standard random feature maps with linear regression for all noise levels. Moreover, there is a robust plateau with mean forecast times of around for a large range of noise strengths with . Notice that for , maybe surprisingly, the mean forecast time of RAFDA does not converge to the one obtained by standard random feature maps but remains threefold larger. This was discussed in Gottwald and Reich (2021): although LR achieves a smaller value of the cost function (7) for it does not generalise as well to unseen data. The cost function is not zero as generically the output data do not lie in the span of the random feature maps. It is important to note that LR with assumes model error (which is not present in our application of simulating the Lorenz-63 system (24) rather than observational error.
We proposed a new data-driven method to estimate surrogate one-step forecast maps from noisy partial observations. Our method determines the surrogate map in the reconstructed phase space for delay vector coordinates, thereby dealing with the problem of having only access to partial observations. We employed the RAFDA framework in the reconstructed phase space, in which the surrogate map is constructed as a linear combination of random feature maps the coefficients of which are learned sequentially using a stochastic EnKF. We showed that learning the coefficients of the linear combination of random feature maps sequentially with incoming new observations rather than by learning them using the method of least-squares on the whole data set greatly increases the forecast capability of the surrogate model and significantly controls the measurement noise.
Our numerical results showed that the quality of the surrogate model exhibits some degree of variance, depending on the draw of the arbitrarily fixed internal parameters of the random feature map modeland . Being able to choose good candidates for those parameters or learning them in conjunction with the rest of the surrogate model would greatly improve the applicability of the method. This is planned for future research.
The random feature map architecture is closely related to reservoir computers (Jaeger, 2002, Jaeger and Haas, 2004, Pathak et al., 2018, Jüngling et al., 2019). In reservoir computers the incoming signal is used in conjunction with an internal reservoir dynamics to produce the output. It is argued that this helps to take into account non-trivial memory of the underlying dynamical system. It will be interesting to see if random feature maps in the space of delay vectors performs as well as reservoir computers. For systems with a fast decay of correlation such as the Lorenz 63 system the additional reservoir dynamics did not lead to an improvement of the surrogate map Gottwald and Reich (2021), but incorporating memory for systems with slow decay of correlation taking into account information from past observations may be important.
SR is supported by Deutsche Forschungsgemeinschaft (DFG) - Project-ID 318763901 - SFB1294
- Algar et al.  S. D. Algar, T. Lymburn, T. Stemler, M. Small, and T. Jüngling. Learned emergence in selfish collective motion. Chaos: An Interdisciplinary Journal of Nonlinear Science, 29(12):123101, 2019. doi: 10.1063/1.5120776. URL https://doi.org/10.1063/1.5120776.
Anderson and Anderson 
J. L. Anderson and S. L. Anderson.
A Monte Carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts.Monthly Weather Review, 127(12):2741–2758, 1999. doi: 10.1175/1520-0493(1999)127¡2741:AMCIOT¿2.0.CO;2.
Breaking the curse of dimensionality with convex neural networks.J. Mach. Learn. Research, 18(19):1–53, 2017a. URL http://jmlr.org/papers/v18/14-546.html.
- Bach [2017b] F. Bach. On the equivalence between kernel quadrature rules and random feature expansions. J. Mach. Learn. Research, 18(21):1–38, 2017b. URL http://www.jmlr.org/papers/v18/15-178.html.
Universal approximation bounds for superposition of a sigmoidal function.IEEE Trans. on Inform. Theory, 39:930–945, 1993. doi: 10.1109/18.256500.
- Bollt  E. Bollt. On explaining the surprising success of reservoir computing forecaster of chaos? The universal machine learning dynamical system with contrast to VAR and DMD. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(1):013108, 2021. doi: 10.1063/5.0024890. URL https://doi.org/10.1063/5.0024890.
Bolton and Zanna 
T. Bolton and L. Zanna.
Applications of deep learning to ocean data inference and subgrid parameterization.Journal of Advances in Modeling Earth Systems, 11(1):376–399, 2019. doi: https://doi.org/10.1029/2018MS001472. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018MS001472.
- Brajard et al.  J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino. Combining data assimilation and machine learning to infer unresolved scale parametrization. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2194):20200086, 2021. doi: 10.1098/rsta.2020.0086. URL https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2020.0086.
- Burgers et al.  G. Burgers, P. J. van Leeuwen, and G. Evensen. Analysis scheme in the ensemble Kalman filter. Monthly Weather Review, 126(6):1719–1724, 1998. doi: 10.1175/1520-0493(1998)126¡1719:ASITEK¿2.0.CO;2.
- Casdagli et al.  M. Casdagli, S. Eubank, J. Farmer, and J. Gibson. State space reconstruction in the presence of noise. Physica D: Nonlinear Phenomena, 51(1):52–98, 1991. ISSN 0167-2789. doi: https://doi.org/10.1016/0167-2789(91)90222-U. URL https://www.sciencedirect.com/science/article/pii/016727899190222U.
- Cleary et al.  E. Cleary, A. Garbuno-Inigo, S. Lan, T. Schneider, and A. M. Stuart. Calibrate, emulate, sample. Journal of Computational Physics, 424:109716, 2021. ISSN 0021-9991. doi: https://doi.org/10.1016/j.jcp.2020.109716. URL https://www.sciencedirect.com/science/article/pii/S0021999120304903.
- Cybenko  G. Cybenko. Approximation by superposition of a sigmoidal function. Math. Contr., Sign., and Syst., 2:303–314, 1989. doi: 10.1007/BF02551274.
- Dueben and Bauer  P. D. Dueben and P. Bauer. Challenges and design choices for global weather and climate models based on machine learning. Geoscientific Model Development, 11(10):3999–4009, 2018. doi: 10.5194/gmd-11-3999-2018. URL https://gmd.copernicus.org/articles/11/3999/2018/.
- E et al.  W. E, C. Ma, S. Wojtowytsch, and L. Wu. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t, 2020.
- Evensen  G. Evensen. Data Assimilation: The Ensemble Kalman Filter. Springer, New York, 2006.
- Fraser and Swinney  A. M. Fraser and H. L. Swinney. Independent coordinates for strange attractors from mutual information. Phys. Rev. A (3), 33(2):1134–1140, 1986. ISSN 1050-2947. doi: 10.1103/PhysRevA.33.1134. URL https://doi.org/10.1103/PhysRevA.33.1134.
Gagne II et al. 
D. J. Gagne II, H. M. Christensen, A. C. Subramanian, and A. H. Monahan.
Machine learning for stochastic parameterization: Generative adversarial networks in the Lorenz ’96 model.Journal of Advances in Modeling Earth Systems, 12(3), 2020. doi: https://doi.org/10.1029/2019MS001896. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2019MS001896.
- Gauthier et al.  D. Gauthier, E. Bollt, A. Griffith, and W. Barbosa. Next generation reservoir computing, 2021.
- Gottwald and Reich  G. A. Gottwald and S. Reich. Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation. Physica D: Nonlinear Phenomena, 423:132911, 2021. ISSN 0167-2789. doi: https://doi.org/10.1016/j.physd.2021.132911. URL https://www.sciencedirect.com/science/article/pii/S0167278921000695.
- Hamilton et al.  F. Hamilton, T. Berry, and T. Sauer. Ensemble Kalman filtering without a model. Phys. Rev. X, 6:011021, Mar 2016. doi: 10.1103/PhysRevX.6.011021.
Han et al. 
J. Han, A. Jentzen, and W. E.
Solving high-dimensional partial differential equations using deep learning.Proceedings of the National Academy of Sciences, 115(34):8505–8510, 2018. doi: 10.1073/pnas.1718942115.
- Han et al.  J. Han, C. Ma, Z. Ma, and W. E. Uniformly accurate machine learning-based hydrodynamic models for kinetic equations. Proceedings of the National Academy of Sciences, 116(44):21983–21991, 2019. doi: 10.1073/pnas.1909854116.
A tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the ”echo state network” approach, 2002.
- Jaeger and Haas  H. Jaeger and H. Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004. doi: 10.1126/science.1091277.
- Jüngling et al.  T. Jüngling, T. Lymburn, T. Stemler, D. Correa, D. Walker, and M. Small. Reconstruction of complex dynamical systems from time series using reservoir computing. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5, 2019. doi: 10.1109/ISCAS.2019.8702137.
- Kantz and Schreiber  H. Kantz and T. Schreiber. Nonlinear time series analysis. Cambridge University Press, Cambridge, second edition, 2004.
- Kennel et al.  M. B. Kennel, R. Brown, and H. D. I. Abarbanel. Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A, 45:3403–3411, Mar 1992. doi: 10.1103/PhysRevA.45.3403. URL https://link.aps.org/doi/10.1103/PhysRevA.45.3403.
- Levine and Stuart  M. E. Levine and A. M. Stuart. A framework for machine learning of model error in dynamical systems, 2021.
- Lorenz  E. N. Lorenz. Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2):130–141, 1963. doi: 10.1175/1520-0469(1963)020¡0130:DNF¿2.0.CO;2.
- Maass et al.  W. Maass, T. Natschläger, and H. Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi: 10.1162/089976602760407955.
- MATLAB  MATLAB. version 18.104.22.1687703 (R2020b). The MathWorks Inc., Natick, Massachusetts, 2020.
- Nadiga  B. Nadiga. Reservoir computing as a tool for climate predictability studies. Journal of Advances in Modeling Earth Systems, n/a(n/a):e2020MS002290, 2020. doi: https://doi.org/10.1029/2020MS002290. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2020MS002290.
Park and Sandberg 
J. Park and I. Sandberg.
Universal approximation using radial-basis-function networks.Neural Computation, 3:246–257, 1991. doi: 10.1162/neco.1922.214.171.124.
- Pathak et al.  J. Pathak, B. Hunt, M. Girvan, Z. Lu, and E. Ott. Model-free prediction of large spatiotemporally chaotic systems from data: A reservoir computing approach. Phys. Rev. Lett., 120:024102, Jan 2018. doi: 10.1103/PhysRevLett.120.024102.
- Platt et al.  J. A. Platt, A. Wong, R. Clark, S. G. Penny, and H. D. I. Abarbanel. Forecasting using reservoir computing: The role of generalized synchronization, 2021.
- Rahimi and Recht  A. Rahimi and B. Recht. Random features for large-scale kernel machines. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 1177–1184. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.pdf.
- Rahimi and Recht  A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In 2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages 555–561, 2008.
- Raissi and Karniadakis  M. Raissi and G. E. Karniadakis. Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357:125 – 141, 2018. doi: https://doi.org/10.1016/j.jcp.2017.11.039.
- Raissi et al.  M. Raissi, P. Perdikaris, and G. Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686 – 707, 2019. doi: https://doi.org/10.1016/j.jcp.2018.10.045.
- Rasp et al.  S. Rasp, M. S. Pritchard, and P. Gentine. Deep learning to represent subgrid processes in climate models. Proceedings of the National Academy of Sciences, 115(39):9684–9689, 2018. ISSN 0027-8424. doi: 10.1073/pnas.1810286115. URL https://www.pnas.org/content/115/39/9684.
- Rotskoff and Vanden-Eijnden  G. M. Rotskoff and E. Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error, 2018.
- Sauer and Yorke  T. Sauer and J. A. Yorke. How manyu delay coordinates do you need? International Journal of Bifurcation and Chaos, 3(3):737–744, 1993.
- Sauer et al.  T. Sauer, J. A. Yorke, and M. Casdagli. Embedology. Journal of Statistical Physics, 65(3):579–616, 1991.
- Schneider et al.  T. Schneider, S. Lan, A. Stuart, and J. Teixeira. Earth system modeling 2.0: A blueprint for models that learn from observations and targeted high-resolution simulations. Geophysical Research Letters, 44(24):12,396–12,417, 2017. doi: https://doi.org/10.1002/2017GL076101. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2017GL076101.
- Schreiber  T. Schreiber. Interdisciplinary application of nonlinear time series methods. Physics Reports, 308(1):1–64, 1999. ISSN 0370-1573. doi: https://doi.org/10.1016/S0370-1573(98)00035-0. URL https://www.sciencedirect.com/science/article/pii/S0370157398000350.
- Schreiber and Kantz  T. Schreiber and H. Kantz. Noise in chaotic data: Diagnosis and treatment. Chaos: An Interdisciplinary Journal of Nonlinear Science, 5(1):133–142, 1995. doi: 10.1063/1.166095. URL https://doi.org/10.1063/1.166095.
- Small  M. Small. Applied nonlinear time series analysis, volume 52 of World Scientific Series on Nonlinear Science. Series A: Monographs and Treatises. World Scientific Publishing Co. Pte. Ltd., Hackensack, NJ, 2005. ISBN 981-256-117-X. doi: 10.1142/9789812567772. URL https://doi.org/10.1142/9789812567772. Applications in physics, physiology and finance.
Sun et al. 
Y. Sun, A. Gilbert, and A. Tewari.
On the approximation capabilities of ReLU neural networks and random ReLU features, 2019.
- Takens  F. Takens. Detecting strange attractors in turbulence. In D. Rand and L.-S. Young, editors, Dynamical Systems and Turbulence, Warwick 1980, pages 366–381, Berlin, Heidelberg, 1981. Springer Berlin Heidelberg. ISBN 978-3-540-38945-3.
- Tomizawa and Sawada  F. Tomizawa and Y. Sawada. Combining Ensemble Kalman Filter and Reservoir Computing to predict spatio-temporal chaotic systems from imperfect observations and models, 2021.
- Wikner et al.  A. Wikner, J. Pathak, B. R. Hunt, I. Szunyogh, M. i. Girvan, and E. Ott. Using data assimilation to train a hybrid forecast system that combines machine-learning and knowledge-based components. Chaos: An Interdisciplinary Journal of Nonlinear Science, 31(5):053114, 2021. doi: 10.1063/5.0048050. URL https://doi.org/10.1063/5.0048050.