Introduction
Forecasting a future system state is an important task in various fields such as neuroscience, ecology, finance, fluid dynamics, weather, and disaster prevention. If accurate system equations are unknown but a time series is observed, we can use modelfree forecasting approaches. Although there are various modelfree methods, if the time series is assumed to be nonlinear and deterministic, one of the most common approaches is delay embedding. According to embedding theorems[1, 2], we can reconstruct the underlying dynamics by delay embedding (see Methods for details). These theorems ensure that the map from the original attractor to the reconstructed attractor has a onetoone correspondence. These embedding theorems have also been extended to multivariate data and nonuniform embeddings[3].
When multivariate time series are observed, various embeddings can be obtained on the basis of the extended embedding theorem[3]. Ye and Sugihara[4] exploited this property; they developed multiview embedding (MVE), which combines multiple forecasts based on the topperforming multiple embeddings scored by the insample error. MVE can yield a more accurate forecast than the single best embedding, especially for short time series[4]. Although MVE works well with lowdimensional data, it cannot be simply applied to highdimensional data. This is because MVE requires forecasts derived from all possible embeddings, whose total number combinatorially increases with the number of variables. Although we can approximate MVE by randomly chosen embeddings if the total number of possible embeddings is beyond the computable one[5], it is difficult to find highperformance embeddings for highdimensional data. Recently, Ma et al.[6] presented an outstanding framework, randomly distributed embedding (RDE), to tackle these highdimensional data. Their key idea is to combine forecasts yielded by randomly generated “nondelay embeddings” of target variables. These “nondelay embeddings” can also reconstruct the original state space for some cases and drastically reduce the possible number of embeddings. Ma et al.[6]
also showed that small embedding dimensions worked fine, even for highdimensional dynamics, and successfully forecasted shortterm highdimensional data using the RDE framework. Although RDE showed outstanding results, there is room for improvement for some specific tasks. For example, “nondelay embedding” may overlook important pieces of information if delayed observations carry such pieces of information, e.g., the upstream river height for flood forecasting. A random distribution may also be a problem for the case where only partial variables are valid system variables; in this case, most of the random embeddings are invalid and not suitable to forecast. In addition, several important hyperparameters—the embedding dimension and the number of embeddings to combine—must be manually or empirically selected. Although ensembles tend to yield better results with significant diversity among its members
[7, 8], RDE only aggregates forecasts for a fixed embedding dimension with a fixed number of embeddings to combine, regardless of their forecast performance.In this study, we propose another forecasting framework that overcomes the disadvantages of MVE and RDE. Our key idea is to prepare diverse “suboptimal embeddings” via combinatorial optimization. We combine the optimal number of these embeddings to maximize the performance of the combined forecast. In contrast to the existing frameworks, the suitable embedding dimensions and the number of embeddings to combine are automatically determined through this procedure. Our proposed framework achieves better forecast performance than the existing frameworks for highdimensional toy models and a realworld flood dataset.
Results
Proposed forecasting framework
Our proposed framework yields a single forecast to combine multiple ones, which are obtained by multiple embeddings. These embeddings are obtained through two steps. First, we obtain suboptimal embeddings, which are suboptimal solutions to minimize the forecast error. The aim of this procedure is to yield diverse embeddings to improve the performance of the combined forecast because such diversity is important for the performance of the ensemble[7, 8]. Second, we make the insample forecasts based on all suboptimal embeddings following MVE and RDE but pick the optimal number of embeddings to minimize the error of the combined forecasts. A schematic of the procedure is illustrated in Figure 1. We describe each step in detail in the following subsubsections.
First step: Preparing suboptimal embeddings
In our framework, we first obtain diverse suboptimal embeddings to minimize the insample forecast error. With the observed time series , a set of possible embeddings
, and a corresponding delay vector
with , we can obtain the stepsahead forecast at time by a map as follows:(1) 
We can obtain such maps by suitable fitting algorithms, e.g., the conventional method of analogues[9], as in MVE, and Gaussian process regression[10], as in RDE. Throughout this paper, we apply the same algorithm, a variation of the method of analogues (see Supplementary Information), to all forecasting frameworks including the proposed one. This is because we need to compare the performance of the frameworks on a fair basis that is not affected by the performance of each fitting algorithm.
To obtain highperformance embeddings in an efficient manner, we solve combinatorial optimization problems instead of random selection or using brute force. We split the set of training time indices into datasets. Then, we solve the following optimization problem for each to minimize the error:
(2) 
where is the th set of training time indices and is an appropriate norm. We use the norm throughout this paper. Note that is computed in an insample manner; we use all of , except for the current time, as its library. When we consider embeddings up to lags to embed the latest observation of the target variable, the number of possible embeddings is . If is sufficiently large, it is almost impossible to compute all possible embeddings because of combinatorial explosion. Here, we take a straightforward approach to minimize the forecast error: an evolution strategy that interprets an embedding as a onedimensional binary series[11, 12]. See Methods for details. Throughout this paper, we applied the evolution strategy[13].
In the process of each optimization, we store not only the best embedding but also the “Hall of Fame,” which preserves the best individuals in each generation. From the Hall of Fame, we select the best solutions that satisfy the condition that the Hamming distance is larger than or equal to a certain threshold as follows:
(3) 
where is the th embedding of the sorted Hall of Fame. Namely, we sort the Hall of Fame by the fitness (the objective function of equation 2) and pick the best embeddings in order from the sorted Hall of Fame to satisfy equation (3). This condition prunes similar embeddings and keeps diverse embeddings for the next step. To apply this procedure, we obtain suboptimal embeddings in total. See also the first part of Algorithm 1 in Methods.
The procedure of the times optimizations yields diverse but highperformance embeddings. If a dataset contains a large number of variables, it is extremely difficult to obtain the exact solution by evolution strategies or any other metaheuristic algorithms. These algorithms find suboptimal solutions for most cases, and these solutions can be easily changed by slight changes in the optimization conditions. The procedure rather exploits this property; the times optimization can find sets of diverse suboptimal solutions. Instead of minimizing the whole insample error with different conditions, we minimize each split dataset and utilize the history of solutions to significantly reduce the computational time. We numerically show the effect of times optimization in Supplementary Information.
To create diverse embeddings and improve the performance of the combined forecast, we can optionally consider finite impulse response (FIR) filters for each suboptimal embedding; namely, embeddings are obtained in total. These filtered embeddings are justified by the Filtered Delay Embedding Prevalence Theorem[2], and a study has shown that the combination of various forecasts via linear filters improves the forecast result in some cases[14]. Although we can compute suboptimal embeddings for each filter, we simply consider filters based on the obtained embeddings. Empirically, the optimization of the embeddings for each filter does not realize a significant improvement in performance to justify the additional computational cost. See Methods for the details of forecasting filtered time series. In this paper, we set the filter coefficients as for all numerical examples.
Second step: Combining forecasts to minimize the insample error
Next, we pick the optimal number of embeddings from the embeddings to minimize the error of the combined forecast. We compute insample forecasts by the embeddings for all of , as done in the MVE or RDE aggregation schemes. We determine the optimal number of embeddings to integrate forecasts, which is neither the square root of the possible number of embeddings (as MVE does) nor the embedding dimension (as the RDE aggregation scheme does). Here, we define a tuple of the forecast indices sorted by the insample error for step as , and the th best embedding is written as . Note that is separately computed for each forecast horizon because accurate embeddings may differ according to the forecast horizons, as discussed later. We obtain to minimize the combined forecast as follows:
(4)  
(5)  
(6) 
We can obtain in a bruteforce manner with a very small computational cost because the th combined forecast can be recursively computed as follows:
(7) 
where . We compute and the corresponding error for all and pick the optimal value to minimize the squared error. Note that can be different for each ; empirically, tends to be larger for a longer step.
For unobserved samples , the forecast is computed as follows:
(8) 
Note that only forecasts need to be computed rather than forecasts. We summarize the whole algorithm in Algorithm 1 in Methods.
Numerical experiments
Toy models
We demonstrate our framework with toy models. We first tested with the 10dimensional Lorenz’96I model [15]. The model is described as 10dimensional differential equations, and we used five of their system variables and five random walks as a dataset—namely, Lorenz’96I’s and random walks . We forecasted up to 10 steps. We compared the proposed framework with existing frameworks—namely, randomly distributed embedding with an aggregation scheme (RDE), multiview embedding (MVE), a variation of MVE (statedependent weighting[5] (SDW)), and the single best embedding via the ES algorithm. See Supplementary Information for the detailed conditions.
We first tested the performance with the data length of 4000 for the database and 500 for the evaluation. Figure 2(a) shows that the proposed forecast achieved the best performance among the existing frameworks up to 10 steps ahead with the length of 4000.
We conducted a sensitivity analysis with the data length. We set the data length, i.e., the size of the training dataset, to {100, 200, 500, 1000, 2000, 4000} and tested the performance with 500 samples for all cases. Figure 2(b) shows that the proposed fivestepsahead forecast yielded the minimum error among the existing frameworks for all the data lengths.
We also checked the robustness against observational noise. We added Gaussian observational noise scaled by the standard deviation of the target variable and tested the performance with a scale range of {0.0, 0.1, 0.2, 0.3}. We set the training data length as 500 and evaluated with 500 samples. Figure
2(c) shows that a lower noise level results in better performance for the proposed framework compared with the other frameworks. Although the proposed framework achieved the best performance up to 10% noise, the performance was degraded by highlevel noise and is the second best for 30% noise.We profiled which variables were embedded for the combined forecasts for the case where the data length is 4000 without noise. We summed embeddings for each variable, where 1 means embedded and 0 means not embedded. Then, we computed the proportion of embedding of each variable, normalizing the sum to one for each . Figure 3(a) shows that random walks , and were seldom embedded with the proposed framework. This result suggests that our proposed framework can select the appropriate variables. The results also show that the proportion of embedding of the target variable is large for shortterm prediction and small for longerterm prediction. This result is consistent with that of an existing study[16].
We also profiled the filter coefficient and the embedding dimension of the proposed combined forecasts. We averaged and over the combined members for each step. Figure 3(b) shows that smaller dimensions are selected for shorter steps, and larger dimensions are selected for longer predictions. The figure also shows that smaller and larger values of are selected for shorter and longer steps, respectively. Because we employ highpass filtering with this model, a smaller focuses on a smaller time period and vice versa.
We investigated the number of combined forecasts and the forecast errors to check whether the proposed integration scheme works properly. Although the proposed framework determines on the basis of the insample error, also minimizes the test error in this example (see Figure 3(c)).
We also demonstrate our framework with the Kuramoto–Sivashinsky equations[17, 18]. We generated 20dimensional time series: the values of the first 10 grids of the Kuramoto–Sivashinsky equations and 10 random walks . We forecasted up to 10 steps using the same parameter values, data length, and noise scales as used in the Lorenz’96I example. See Supplementary Information for the detailed conditions.
Despite the increases in the variables, Figures 4(a) and (b) show that the proposed forecast yielded the minimum error among the existing frameworks, as in the Lorenz’96I example. Regarding the noise sensitivity, a lower noise level results in better performance for the proposed framework compared to the other frameworks. The proposed framework achieved the best performance for a wide data range and relatively small observational noise.
Finally, we tested the sensitivity to the number of variables. We changed the dimensions of Lorenz’96I to {10, 20, 40, 80, 160, 320}, and we substituted a half of each dataset with random walks. We set the training data length to 500 and evaluated with 500 samples. For all ranges of variables, the proposed framework achieved the best performance (Figure 5(a)). We also tested the case where all variables are available and the datasets do not contain random walks; this is the case in which delay embedding is not needed because all system variables are observed. In this case, the performance difference is negligible for all frameworks (except the single best embedding), and RDE achieved the best score for the case where the number of variables is 320. Although this condition is too ideal because the observation function is an identity map, we can expect RDE to perform well for such cases that include many informative variables.
Flood dataset
We tested the proposed framework with realworld data, namely the flood forecasting competition dataset at “Artificial Neural Network Experiment (ANNEX 2005/2006)”
[19]. This dataset contains nine variables: the river stages of the target site and three upstream sites and five rain gauges for three periods. We forecasted 6, 12, 18, and 24 h ahead of the target river stage using all nine variables. See Supplementary Information for the detailed conditions. We also compared our framework with four other conventional machinelearning methods used in an existing study
[5]: a recurrent neural network with long shortterm memory (LSTM)
[20], support vector regression (SVR)[21], and random forest regression
[22].The results in Table 1 show that the proposed framework yielded the best rootmeansquare error (RMSE) through 12–24 h ahead, and the differences compared to the other methods were expanded over the forecast horizon. Although the 6 h (one step)ahead forecast was not the best, the difference compared to the best score was negligible. The dataset contains only 1460 points, and longterm memory is not required for this task; hence, LSTM did not perform well. Although the test data include the largest river stage on 19950201, Figure 6(a) shows that the proposed framework properly forecasted the inexperienced river stage.
We compared the combined forecast with its individual members obtained with the ES algorithm. Regarding the individual members, the order of the insample score was different from the order of the test (see Figure 6(b)). In contrast, the combined forecast achieved the minimum error by far for the both the insample and test errors. This forecast stability is one of the largest advantages of combining forecasts.
We also profiled and averaged over the combined members for each step. The results in Figure 6(c) suggest the same trends as the toy models, even for the realworld data. Specifically, smaller dimensions and smaller values of were selected for shorter steps.
Discussion
Our numerical experiments suggest that our proposed framework works well for a wide range of the data length, including short time series such as 100. In contrast, the framework did not appear to work fine with very noisy data, and MVE achieved the best performance, as shown in the noise sensitivity analyses (Figures 2(c) and 4(c)). This suggests that noise prevents the framework from selecting optimal embeddings on the basis of the squared errors, and it results in obtaining invalid embeddings. MVE can still be a good option for very noisy data. Note that the performance of RDE was not satisfactory in some of our numerical experiments, but this is because the RDE framework was originally proposed for extremely highdimensional data, e.g., hundreds of variables. As shown in Figure 5(b), we can expect RDE to perform well for cases that include many informative variables, and delay embedding is not needed.
As shown in the numerical experiments, the proposed framework worked well for highdimensional dynamics and datasets containing highdimensional variables. On the other hand, it is not suitable to apply it to relatively lowdimensional data. We tested the framework with lowdimensional datasets, and the results suggest that the proposed framework does not always yield the best performance, especially for lowdimensional datasets such as the Rössler equations[23] (see Supplementary Information). This is because it is easy to find suitable embeddings to forecast in a bruteforce manner, especially for lowdimensional datasets, while it becomes combinatorially difficult as the number of variables increases. In addition, the aim of times optimization is to yield diverse embeddings (see the effect of times optimization in Supplementary Information). However, if a dataset contains a small number of variables, almost the exact solutions are obtained for every optimization.
The proposed method combines forecasts via the simple average, optimizing the number of forecasts. The simple average works fine for most cases because suboptimal embeddings can equally forecast well thanks to solving the combinatorial optimization problem, and the test performance order is not always the same as the corresponding training data, especially for realworld data (see Figure 6(b)). If suboptimal embeddings show quite different forecast performances, the weighted average based on the error distribution[6] may work better. We can also apply the expert advice framework[24] if better forecasts are assumed to be changed over time.
We need to select the values of several parameters to apply the proposed framework, i.e., the number of split datasets , the number of the suboptimal embeddings for each batch , the criteria of the Hamming distance , the number of filters , and their components. We can control the variations in forecasts with these parameters. In order to set , we consider the number of samples in each divided dataset; empirically, the framework works well with more than 50 samples for each divided dataset. Samples that are too small prevent the framework from finding good embeddings, but samples that are too large reduce the diversity of the embeddings. Although can be any large integer (up to the total number of individuals of the evolution strategy), is sufficient for most cases. Note that the value of that is too large is overkill and unnecessarily increases the computational cost. The criteria for the Hamming distance directly controls the similarity of suboptimal embeddings. We set in this study, and it worked fine for all cases. Note that conventional MVE and RDE guarantee that the Hamming distance among all embeddings is larger than two (i.e., ) for a fixed embedding dimension. In practice, appropriate filters depend on the prediction steps. If we predict shorter time periods, highpass filters will work fine; if we predict longer time periods, low frequencies are needed to obtain fine forecasts, as the numerical experiments suggest. Although we treat highpass filters in this study, lowpass filters may work for extremely noisy data. We need to understand the effects of these parameters and set the appropriate values within reason, , which is usually much smaller than the total number of embeddings evaluated in the first step.
Our numerical experiments suggest that the appropriate embeddings vary with the prediction steps. The results show that a smaller with a smaller is preferable for a shortterm forecast, and a larger with a larger is preferable for a longerterm forecast. These results are consistent with those of an existing study[16] and with the fact that the longerstep forecasts need more complex map from the current inputs to the forecasts. Therefore, more information is needed to express the complex map.
Although we applied the ES algorithm to obtain the suboptimal embeddings, it is worth considering other methods to solve the combinatorial optimization problem. We can apply other evolution algorithms, e.g., simulated annealing[25] and ant colony optimization[26]. Any algorithm may work as long as it can store the suboptimal embeddings through its optimization procedure. Testing and evaluating other algorithms are open problems.
Throughout this study, we used a variation of a local nonlinear prediction method to forecast the target variable. One of the most significant advantages of these local prediction methods is that we can compute insample forecasts very easily with a small computational cost; all we have to do is to exclude the current query from the nearest neighbors. In addition, owing to recent progress in approximate neighboring search algorithms (e.g., Refs. 27, 28), these algorithms are fast enough to apply the evolution algorithms. Moreover, it is also suitable to obtain suboptimal embeddings through the forecast errors because local nonlinear prediction methods are sensitive to the false nearest neighbors derived from invalid features. However, we can apply other regression methods, e.g., Gaussian process regression[10], as done in RDE[6], for further improvements. Note that these methods incur a very large computational cost to yield insample forecasts, as mentioned in Ref. 6. To reduce the computational cost, we can apply these methods to only the process in the second step and can approximate the insample error by the kfold crossvalidation error.
Methods
Delay embedding
Delay embedding is a method of reconstructing the original state space using the delay coordinates . Here, with the dimensional observed time series , variable , and time lag , is defined as follows:
(9) 
where is the embedding dimension, , and no duplication is allowed for any element. According to embedding theorems[1, 3], an attractor reconstructed by is an embedding with the appropriate . Note that the number of possible reconstructions increase combinatorially.
Optimizing embedding by an evolution strategy
One can optimize embedding to solve combinatorial optimization problems to treat embeddings as binary series[11, 12]. Here, we solve equation (2) by a combinatorial optimization problem. We treat embeddings as binary series, i.e., 1 meaning embedded and 0 meaning not embedded. For example, with and , the embedding can be expressed by the binary code : the first element 1 corresponds to to embed, the second element 0 corresponds to to embed, the third element 1 corresponds to to embed, and so on. The optimization of a binary series is a typical combinatorial optimization problem. In this paper, we apply the evolution strategy[13]. The parameter denotes the number of parents, denotes the number of offspring, and denotes plus selection, which means that the next parents are selected from both the previous parents and offspring. Precisely, the algorithm yields new offspring from the topperforming individuals selected from the previous parents and offspring. The algorithm is an elitist method since the algorithm retains the best individuals unless they are replaced by superior individuals. The algorithm tends to converge fast, but it is likely to become trapped at local optima. Therefore, the algorithm is suitable for our proposed framework. However, it is worth considering the application of other existing methods[29, 30, 31] to obtain suboptimal embeddings.
Forecasting through FIR filters
We apply FIR filters to the original time series to obtain various embeddings and forecasts. We apply the following linear transformation for each variable:
(10) 
where ’s are filter coefficients with elements, , and and are the coefficients for standardization. After forecasting , we restore for the original time series. For , we restore using as follows:
(11) 
For , we obtain the stepsahead forecast by substituting forecasts for the unobserved as follows:
(12) 
We can restore for any to apply equation (12) recursively.
We can apply FIR filters such as the moving average filter to suppress noise. In contrast, an existing study[14] focuses on the filtered delay coordinates constructed by blending the derivatives of time series and the original one to search the nearest neighbors. Although these are a type of highpass FIR filters, such filters improve the prediction accuracy in less noisy time series.
Overall algorithm of the proposed framework
We denote a multivariate time series as and the target variable to be forecasted as . The proposed forecasting framework is presented in Algorithm 1 in terms of the userdefined number of split training datasets , the number of topperforming embeddings chosen for each split , the linear filters, and the number of filters .
References
 [1] Takens, F. Detecting strange attractors in turbulence. Lecture Notes in Mathematics, Berlin Springer Verlag 898, 366, DOI: 10.1007/BFb0091924 (1981).
 [2] Sauer, T., Yorke, J. A. & Casdagli, M. Embedology. Journal of Statistical Physics 65, 579–616, DOI: 10.1007/BF01053745 (1991).
 [3] Deyle, E. R. & Sugihara, G. Generalized theorems for nonlinear state space reconstruction. PLoS ONE 6, e18295, DOI: 10.1371/journal.pone.0018295 (2011).

[4]
Ye, H. & Sugihara, G.
Information leverage in interconnected ecosystems: Overcoming the curse of dimensionality.
Science 353, 922–925, DOI: 10.1126/science.aag0863 (2016).  [5] Okuno, S., Aihara, K. & Hirata, Y. Combining multiple forecasts for multivariate time series via statedependent weighting. Chaos: An Interdisciplinary Journal of Nonlinear Science 29, 33128, DOI: 10.1063/1.5057379 (2019).
 [6] Ma, H., Leng, S., Aihara, K., Lin, W. & Chen, L. Randomly distributed embedding making shortterm highdimensional data predictable. Proceedings of the National Academy of Sciences 115, E9994–E10002, DOI: 10.1073/pnas.1802987115 (2018).
 [7] Sollich, P. & Krogh’", A. Learning with ensembles: how overfitting can be useful. In Advances in neural information processing systems, 190–196 (1996).

[8]
Kuncheva, L. I. & Whitaker, C. J.
Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy.
Machine Learning 51, 181–207, DOI: 10.1023/A:1022859003006 (2003).  [9] Lorenz, E. N. Atmospheric predictability as revealed by naturally occurring analogues. Journal of the Atmospheric Sciences 26, 636–646 (1969).
 [10] Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006).

[11]
Vitrano, J. B., Povinelli, R. J.,
B Vitrano, J. & Povinelli, R. J.
Selecting dimensions and delay values for a timedelay embedding using a genetic algorithm.
InProceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation
, GECCO’01, 1423–1430 (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001).  [12] Small, M. Optimal time delay embedding for nonlinear time series modeling (2003). arXiv:nlin/0312011.
 [13] Schwefel, H.P. Numerical Optimization of Computer Models (John Wiley & Sons, Chichester, 1981).
 [14] Okuno, S., Takeuchi, T., Horai, S., Aihara, K. & Hirata, Y. Avoiding underestimates for time series prediction by statedependent local integration. Mathematical Engineering Technical Reports METR 2017–22, The University of Tokyo (2017).
 [15] Lorenz, E. N. Predictability: a problem partly solved. In Seminar on Predictability, 1–18 (ECMWF, Reading, England, 1996).
 [16] Chayama, M. & Hirata, Y. When univariate modelfree time series prediction is better than multivariate. Physics Letters A 380, 2359–2365, DOI: https://doi.org/10.1016/j.physleta.2016.05.027 (2016).
 [17] Kuramoto, Y. & Tsuzuki, T. Persistent Propagation of Concentration Waves in Dissipative Media Far from Thermal Equilibrium. Progress of Theoretical Physics 55, 356–369, DOI: 10.1143/PTP.55.356 (1976).
 [18] Sivashinsky, G. I. Nonlinear analysis of hydrodynamic instability in laminar flamesI. Derivation of basic equations. Acta Astronautica 4, 1177–1206, DOI: 10.1016/00945765(77)900960 (1977).
 [19] Dawson, C. et al. A comparative study of artificial neural network techniques for river stage forecasting. In Proceedings of the International Joint Conference on Neural Networks, vol. 4, 2666–2670, DOI: 10.1109/IJCNN.2005.1556324 (IEEE, Montreal, Canada, 2005).
 [20] Hochreiter, S. & Schmidhuber, J. Long shortterm memory. Neural Comput. 9, 1735–1780, DOI: 10.1162/neco.1997.9.8.1735 (1997).

[21]
Boser, B. E., Guyon, I. M. &
Vapnik, V. N.
A Training Algorithm for Optimal Margin
Classifiers.
In
Proceedings of the Fifth Annual Workshop on Computational Learning Theory
, COLT ’92, 144–152, DOI: 10.1145/130385.130401 (ACM, New York, NY, USA, 1992).  [22] Breiman, L. Random Forests. Machine Learning 45, 5–32, DOI: 10.1023/A:1010933404324 (2001).
 [23] Rössler, O. E. An equation for continuous chaos. Physics Letters A 57, 397–398 (1976).
 [24] CesaBianchi, N. & Lugosi, G. Prediction, Learning, and Games (Cambridge University Press, 2006).
 [25] Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated annealing. Science 220, 671–680 (1983).
 [26] Dorigo, M. & Stützle, T. Ant Colony Optimization (Bradford Company, Scituate, MA, USA, 2004).
 [27] Muja, M. & Lowe, D. G. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 2227–2240, DOI: 10.1109/TPAMI.2014.2321376 (2014).

[28]
Fu, C. & Cai, D.
EFANNA : An extremely fast approximate nearest neighbor search algorithm based on kNN graph (2016).
arXiv:1609.07228.  [29] Runge, J., Donner, R. V. & Kurths, J. Optimal modelfree prediction from multivariate time series. Physical Review E 91, 1–14, DOI: 10.1103/PhysRevE.91.052909 (2015). arXiv:1506.05822.
 [30] Vlachos, I. & Kugiumtzis, D. State space reconstruction from multiple time series. In Topics on Chaotic Systems: Selected Papers from Chaos 2008 International Conference, 378–387, DOI: 10.1142/9789814271349_0043 (World Scientific, 2009).
 [31] Chen, Y. & Wong, M. L. An ant colony optimization approach for stacking ensemble. Second World Congress on Nature and Biologically Inspired Computing (NaBIC) 146–151, DOI: 10.1109/NABIC.2010.5716282 (2010).
Acknowledgments
We thank Professor Christian W. Dawson for permission to use the flood dataset. This research is partially supported by Kozo Keikaku Engineering Inc., JSPS KAKENHI 15H05707, WPI, MEXT, Japan, AMED JP19dm0307009, and JST CREST JPMJCR14D2, Japan. We thank Dr. Takahiro Omi for fruitful discussions.
Author contributions statement
S.O., K.A., and Y.H. conceived the original design of this study. S.O. and A.K. designed the study of the toy models, and S.O. and Y.H. designed the study of the flood data. S.O. conducted the numerical experiment. All authors contributed to interpreting the results and writing the manuscript. All authors checked the manuscript and agreed to submit the final version of the manuscript.
Additional information
Competing Interests: The authors declare that they have no competing interests.
Prediction steps [h]  Proposed  RDE  MVE  LSTM  SVR  Random Forest 

6  0.068  0.069  0.060  0.123  0.073  0.079 
12  0.167  0.174  0.171  0.213  0.177  0.185 
18  0.247  0.278  0.273  0.291  0.268  0.275 
24  0.349  0.389  0.373  0.372  0.365  0.363 
Comments
There are no comments yet.