Forecasting high-dimensional dynamics exploiting suboptimal embeddings

07/02/2019 ∙ by Shunya Okuno, et al. ∙ 2

Delay embedding---a method for reconstructing dynamical systems by delay coordinates---is widely used to forecast nonlinear time series as a model-free approach. When multivariate time series are observed, several existing frameworks can be applied to yield a single forecast combining multiple forecasts derived from various embeddings. However, the performance of these frameworks is not always satisfactory because they randomly select embeddings or use brute force and do not consider the diversity of the embeddings to combine. Herein, we develop a forecasting framework that overcomes these existing problems. The framework exploits various "suboptimal embeddings" obtained by minimizing the in-sample error via combinatorial optimization. The framework achieves the best results among existing frameworks for sample toy datasets and a real-world flood dataset. We show that the framework is applicable to a wide range of data lengths and dimensions. Therefore, the framework can be applied to various fields such as neuroscience, ecology, finance, fluid dynamics, weather, and disaster prevention.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Forecasting a future system state is an important task in various fields such as neuroscience, ecology, finance, fluid dynamics, weather, and disaster prevention. If accurate system equations are unknown but a time series is observed, we can use model-free forecasting approaches. Although there are various model-free methods, if the time series is assumed to be nonlinear and deterministic, one of the most common approaches is delay embedding. According to embedding theorems[1, 2], we can reconstruct the underlying dynamics by delay embedding (see Methods for details). These theorems ensure that the map from the original attractor to the reconstructed attractor has a one-to-one correspondence. These embedding theorems have also been extended to multivariate data and nonuniform embeddings[3].

When multivariate time series are observed, various embeddings can be obtained on the basis of the extended embedding theorem[3]. Ye and Sugihara[4] exploited this property; they developed multiview embedding (MVE), which combines multiple forecasts based on the top-performing multiple embeddings scored by the in-sample error. MVE can yield a more accurate forecast than the single best embedding, especially for short time series[4]. Although MVE works well with low-dimensional data, it cannot be simply applied to high-dimensional data. This is because MVE requires forecasts derived from all possible embeddings, whose total number combinatorially increases with the number of variables. Although we can approximate MVE by randomly chosen embeddings if the total number of possible embeddings is beyond the computable one[5], it is difficult to find high-performance embeddings for high-dimensional data. Recently, Ma et al.[6] presented an outstanding framework, randomly distributed embedding (RDE), to tackle these high-dimensional data. Their key idea is to combine forecasts yielded by randomly generated “nondelay embeddings” of target variables. These “nondelay embeddings” can also reconstruct the original state space for some cases and drastically reduce the possible number of embeddings. Ma et al.[6]

also showed that small embedding dimensions worked fine, even for high-dimensional dynamics, and successfully forecasted short-term high-dimensional data using the RDE framework. Although RDE showed outstanding results, there is room for improvement for some specific tasks. For example, “nondelay embedding” may overlook important pieces of information if delayed observations carry such pieces of information, e.g., the upstream river height for flood forecasting. A random distribution may also be a problem for the case where only partial variables are valid system variables; in this case, most of the random embeddings are invalid and not suitable to forecast. In addition, several important hyperparameters—the embedding dimension and the number of embeddings to combine—must be manually or empirically selected. Although ensembles tend to yield better results with significant diversity among its members

[7, 8], RDE only aggregates forecasts for a fixed embedding dimension with a fixed number of embeddings to combine, regardless of their forecast performance.

In this study, we propose another forecasting framework that overcomes the disadvantages of MVE and RDE. Our key idea is to prepare diverse “suboptimal embeddings” via combinatorial optimization. We combine the optimal number of these embeddings to maximize the performance of the combined forecast. In contrast to the existing frameworks, the suitable embedding dimensions and the number of embeddings to combine are automatically determined through this procedure. Our proposed framework achieves better forecast performance than the existing frameworks for high-dimensional toy models and a real-world flood dataset.

Results

Proposed forecasting framework

Our proposed framework yields a single forecast to combine multiple ones, which are obtained by multiple embeddings. These embeddings are obtained through two steps. First, we obtain suboptimal embeddings, which are suboptimal solutions to minimize the forecast error. The aim of this procedure is to yield diverse embeddings to improve the performance of the combined forecast because such diversity is important for the performance of the ensemble[7, 8]. Second, we make the in-sample forecasts based on all suboptimal embeddings following MVE and RDE but pick the optimal number of embeddings to minimize the error of the combined forecasts. A schematic of the procedure is illustrated in Figure 1. We describe each step in detail in the following sub-subsections.

Figure 1: Schematic of the proposed forecasting procedure. We prepare a pool of suboptimal embeddings in the first step. We solve combinatorial optimization problems to obtain various embeddings in this step. Next, we pick embeddings to minimize the error of the combined forecast in the second step. We combine the forecasts obtained by the embeddings to test the time series.

First step: Preparing suboptimal embeddings

In our framework, we first obtain diverse suboptimal embeddings to minimize the in-sample forecast error. With the observed time series , a set of possible embeddings

, and a corresponding delay vector

with , we can obtain the -steps-ahead forecast at time by a map as follows:

(1)

We can obtain such maps by suitable fitting algorithms, e.g., the conventional method of analogues[9], as in MVE, and Gaussian process regression[10], as in RDE. Throughout this paper, we apply the same algorithm, a variation of the method of analogues (see Supplementary Information), to all forecasting frameworks including the proposed one. This is because we need to compare the performance of the frameworks on a fair basis that is not affected by the performance of each fitting algorithm.

To obtain high-performance embeddings in an efficient manner, we solve combinatorial optimization problems instead of random selection or using brute force. We split the set of training time indices into datasets. Then, we solve the following optimization problem for each to minimize the error:

(2)

where is the th set of training time indices and is an appropriate norm. We use the norm throughout this paper. Note that is computed in an in-sample manner; we use all of , except for the current time, as its library. When we consider embeddings up to lags to embed the latest observation of the target variable, the number of possible embeddings is . If is sufficiently large, it is almost impossible to compute all possible embeddings because of combinatorial explosion. Here, we take a straightforward approach to minimize the forecast error: an evolution strategy that interprets an embedding as a one-dimensional binary series[11, 12]. See Methods for details. Throughout this paper, we applied the evolution strategy[13].

In the process of each optimization, we store not only the best embedding but also the “Hall of Fame,” which preserves the best individuals in each generation. From the Hall of Fame, we select the best solutions that satisfy the condition that the Hamming distance is larger than or equal to a certain threshold as follows:

(3)

where is the th embedding of the sorted Hall of Fame. Namely, we sort the Hall of Fame by the fitness (the objective function of equation 2) and pick the best embeddings in order from the sorted Hall of Fame to satisfy equation (3). This condition prunes similar embeddings and keeps diverse embeddings for the next step. To apply this procedure, we obtain suboptimal embeddings in total. See also the first part of Algorithm 1 in Methods.

The procedure of the times optimizations yields diverse but high-performance embeddings. If a dataset contains a large number of variables, it is extremely difficult to obtain the exact solution by evolution strategies or any other metaheuristic algorithms. These algorithms find suboptimal solutions for most cases, and these solutions can be easily changed by slight changes in the optimization conditions. The procedure rather exploits this property; the times optimization can find sets of diverse suboptimal solutions. Instead of minimizing the whole in-sample error with different conditions, we minimize each split dataset and utilize the history of solutions to significantly reduce the computational time. We numerically show the effect of times optimization in Supplementary Information.

To create diverse embeddings and improve the performance of the combined forecast, we can optionally consider finite impulse response (FIR) filters for each suboptimal embedding; namely, embeddings are obtained in total. These filtered embeddings are justified by the Filtered Delay Embedding Prevalence Theorem[2], and a study has shown that the combination of various forecasts via linear filters improves the forecast result in some cases[14]. Although we can compute suboptimal embeddings for each filter, we simply consider filters based on the obtained embeddings. Empirically, the optimization of the embeddings for each filter does not realize a significant improvement in performance to justify the additional computational cost. See Methods for the details of forecasting filtered time series. In this paper, we set the filter coefficients as for all numerical examples.

Second step: Combining forecasts to minimize the in-sample error

Next, we pick the optimal number of embeddings from the embeddings to minimize the error of the combined forecast. We compute in-sample forecasts by the embeddings for all of , as done in the MVE or RDE aggregation schemes. We determine the optimal number of embeddings to integrate forecasts, which is neither the square root of the possible number of embeddings (as MVE does) nor the embedding dimension (as the RDE aggregation scheme does). Here, we define a tuple of the forecast indices sorted by the in-sample error for step as , and the th best embedding is written as . Note that is separately computed for each forecast horizon because accurate embeddings may differ according to the forecast horizons, as discussed later. We obtain to minimize the combined forecast as follows:

(4)
(5)
(6)

We can obtain in a brute-force manner with a very small computational cost because the th combined forecast can be recursively computed as follows:

(7)

where . We compute and the corresponding error for all and pick the optimal value to minimize the squared error. Note that can be different for each ; empirically, tends to be larger for a longer step.

For unobserved samples , the forecast is computed as follows:

(8)

Note that only forecasts need to be computed rather than forecasts. We summarize the whole algorithm in Algorithm 1 in Methods.

Numerical experiments

Toy models

We demonstrate our framework with toy models. We first tested with the 10-dimensional Lorenz’96I model [15]. The model is described as 10-dimensional differential equations, and we used five of their system variables and five random walks as a dataset—namely, Lorenz’96I’s and random walks . We forecasted up to 10 steps. We compared the proposed framework with existing frameworks—namely, randomly distributed embedding with an aggregation scheme (RDE), multiview embedding (MVE), a variation of MVE (state-dependent weighting[5] (SDW)), and the single best embedding via the -ES algorithm. See Supplementary Information for the detailed conditions.

We first tested the performance with the data length of 4000 for the database and 500 for the evaluation. Figure 2(a) shows that the proposed forecast achieved the best performance among the existing frameworks up to 10 steps ahead with the length of 4000.

We conducted a sensitivity analysis with the data length. We set the data length, i.e., the size of the training dataset, to {100, 200, 500, 1000, 2000, 4000} and tested the performance with 500 samples for all cases. Figure 2(b) shows that the proposed five-steps-ahead forecast yielded the minimum error among the existing frameworks for all the data lengths.

We also checked the robustness against observational noise. We added Gaussian observational noise scaled by the standard deviation of the target variable and tested the performance with a scale range of {0.0, 0.1, 0.2, 0.3}. We set the training data length as 500 and evaluated with 500 samples. Figure 

2(c) shows that a lower noise level results in better performance for the proposed framework compared with the other frameworks. Although the proposed framework achieved the best performance up to 10% noise, the performance was degraded by high-level noise and is the second best for 30% noise.

We profiled which variables were embedded for the combined forecasts for the case where the data length is 4000 without noise. We summed embeddings for each variable, where 1 means embedded and 0 means not embedded. Then, we computed the proportion of embedding of each variable, normalizing the sum to one for each . Figure 3(a) shows that random walks , and were seldom embedded with the proposed framework. This result suggests that our proposed framework can select the appropriate variables. The results also show that the proportion of embedding of the target variable is large for short-term prediction and small for longer-term prediction. This result is consistent with that of an existing study[16].

We also profiled the filter coefficient and the embedding dimension of the proposed combined forecasts. We averaged and over the combined members for each step. Figure 3(b) shows that smaller dimensions are selected for shorter steps, and larger dimensions are selected for longer predictions. The figure also shows that smaller and larger values of are selected for shorter and longer steps, respectively. Because we employ high-pass filtering with this model, a smaller focuses on a smaller time period and vice versa.

We investigated the number of combined forecasts and the forecast errors to check whether the proposed integration scheme works properly. Although the proposed framework determines on the basis of the in-sample error, also minimizes the test error in this example (see Figure 3(c)).

We also demonstrate our framework with the Kuramoto–Sivashinsky equations[17, 18]. We generated 20-dimensional time series: the values of the first 10 grids of the Kuramoto–Sivashinsky equations and 10 random walks . We forecasted up to 10 steps using the same parameter values, data length, and noise scales as used in the Lorenz’96I example. See Supplementary Information for the detailed conditions.

Despite the increases in the variables, Figures 4(a) and (b) show that the proposed forecast yielded the minimum error among the existing frameworks, as in the Lorenz’96I example. Regarding the noise sensitivity, a lower noise level results in better performance for the proposed framework compared to the other frameworks. The proposed framework achieved the best performance for a wide data range and relatively small observational noise.

Finally, we tested the sensitivity to the number of variables. We changed the dimensions of Lorenz’96I to {10, 20, 40, 80, 160, 320}, and we substituted a half of each dataset with random walks. We set the training data length to 500 and evaluated with 500 samples. For all ranges of variables, the proposed framework achieved the best performance (Figure 5(a)). We also tested the case where all variables are available and the datasets do not contain random walks; this is the case in which delay embedding is not needed because all system variables are observed. In this case, the performance difference is negligible for all frameworks (except the single best embedding), and RDE achieved the best score for the case where the number of variables is 320. Although this condition is too ideal because the observation function is an identity map, we can expect RDE to perform well for such cases that include many informative variables.

Figure 2: Forecast performance with the Lorenz’96I model: comparisons of the performance (a) up to 10 steps ahead with the fixed data length (4000) without noise, (b) with different values of the data length, and (c) with different scales of observational noise. We computed the RMSE of the five-steps-ahead forecasts with randomly distributed embedding (RDE), multiview embedding (MVE), state-dependent weighting (SDW), single-best embedding based on the

-ES algorithm (SBE), and the proposed framework. These tests were carried out with 20 datasets generated with different random initial conditions and noise. The median, upper quartile, and lower quartile are shown.

Figure 3: Forecast profile of the Lorenz’96I model with the data length of 4000. Panel (a) shows the proportion of embedding of the proposed forecasts. The color indicates the proportion of embedding for each variable averaged over the number of combined forecasts for each step. Note that for each prediction step, the sum of the proportions of all variables is one. Variables are the variables of the 10-dimensional Lorenz’96I equations, and the other variables are random walks. Panel (b) shows the averaged filter coefficient and embedding dimension over the combined forecasts for each step. Note that the averaged dimensions can be a decimal fraction since they are averaged over the number of combined forecasts for each step. Panel (c) shows the relation between the number of combined forecasts and the errors. The solid line shows the in-sample error, and the dashed line shows the test error. The vertical solid line shows the selected number of combined forecasts to minimize the in-sample error.
Figure 4: Forecast performance with the Kuramoto–Sivashinsky equations: comparisons of the performance (a) up to 10 steps ahead with the fixed data length (4000) without noise, (b) with different values of the data length, and (c) with different scales of observational noise. We computed the RMSE of the five-steps-ahead forecasts with randomly distributed embedding (RDE), multiview embedding (MVE), state-dependent weighting (SDW), single-best embedding based on the -ES algorithm (SBE), and the proposed framework. These tests were carried out with 20 datasets generated with different random initial conditions and noise. The median, upper quartile, and lower quartile are shown.
Figure 5: Forecast performance with the Lorenz’96I model for various numbers of variables: cases where (a) a half of the variables are substituted with random walks and (b) all variables are available. We computed the RMSE of the five-steps-ahead forecasts with randomly distributed embedding (RDE), multiview embedding (MVE), state-dependent weighting (SDW), single-best embedding based on the -ES algorithm (SBE), and the proposed framework. These tests were carried out with 20 datasets generated with different random initial conditions and noise. The median, upper quartile, and lower quartile are shown.

Flood dataset

We tested the proposed framework with real-world data, namely the flood forecasting competition dataset at “Artificial Neural Network Experiment (ANNEX 2005/2006)”

[19]

. This dataset contains nine variables: the river stages of the target site and three upstream sites and five rain gauges for three periods. We forecasted 6, 12, 18, and 24 h ahead of the target river stage using all nine variables. See Supplementary Information for the detailed conditions. We also compared our framework with four other conventional machine-learning methods used in an existing study

[5]

: a recurrent neural network with long short-term memory (LSTM)

[20], support vector regression (SVR)[21]

, and random forest regression

[22].

The results in Table 1 show that the proposed framework yielded the best root-mean-square error (RMSE) through 12–24 h ahead, and the differences compared to the other methods were expanded over the forecast horizon. Although the 6 h (one step)-ahead forecast was not the best, the difference compared to the best score was negligible. The dataset contains only 1460 points, and long-term memory is not required for this task; hence, LSTM did not perform well. Although the test data include the largest river stage on 1995-02-01, Figure 6(a) shows that the proposed framework properly forecasted the inexperienced river stage.

We compared the combined forecast with its individual members obtained with the -ES algorithm. Regarding the individual members, the order of the in-sample score was different from the order of the test (see Figure 6(b)). In contrast, the combined forecast achieved the minimum error by far for the both the in-sample and test errors. This forecast stability is one of the largest advantages of combining forecasts.

We also profiled and averaged over the combined members for each step. The results in Figure 6(c) suggest the same trends as the toy models, even for the real-world data. Specifically, smaller dimensions and smaller values of were selected for shorter steps.

Figure 6: Forecast results for the flood dataset. Panel (a) shows a comparison of the ground truth and the proposed 24-h-ahead forecast. The proposed forecast did not underestimate the maximum river stage, which is the maximum value of the whole dataset. Panel (b) shows a comparison of the in-sample and test errors of the ensemble members. The results demonstrate the difficulty of selecting the best forecast because the best in-sample forecast does not always perform the best. In contrast, the combined forecast performed the best by far for both training and test data. Panel (c) shows the filter coefficient and embedding dimension averaged over the combined forecasts for each step.

Discussion

Our numerical experiments suggest that our proposed framework works well for a wide range of the data length, including short time series such as 100. In contrast, the framework did not appear to work fine with very noisy data, and MVE achieved the best performance, as shown in the noise sensitivity analyses (Figures 2(c) and  4(c)). This suggests that noise prevents the framework from selecting optimal embeddings on the basis of the squared errors, and it results in obtaining invalid embeddings. MVE can still be a good option for very noisy data. Note that the performance of RDE was not satisfactory in some of our numerical experiments, but this is because the RDE framework was originally proposed for extremely high-dimensional data, e.g., hundreds of variables. As shown in Figure 5(b), we can expect RDE to perform well for cases that include many informative variables, and delay embedding is not needed.

As shown in the numerical experiments, the proposed framework worked well for high-dimensional dynamics and datasets containing high-dimensional variables. On the other hand, it is not suitable to apply it to relatively low-dimensional data. We tested the framework with low-dimensional datasets, and the results suggest that the proposed framework does not always yield the best performance, especially for low-dimensional datasets such as the Rössler equations[23] (see Supplementary Information). This is because it is easy to find suitable embeddings to forecast in a brute-force manner, especially for low-dimensional datasets, while it becomes combinatorially difficult as the number of variables increases. In addition, the aim of times optimization is to yield diverse embeddings (see the effect of times optimization in Supplementary Information). However, if a dataset contains a small number of variables, almost the exact solutions are obtained for every optimization.

The proposed method combines forecasts via the simple average, optimizing the number of forecasts. The simple average works fine for most cases because suboptimal embeddings can equally forecast well thanks to solving the combinatorial optimization problem, and the test performance order is not always the same as the corresponding training data, especially for real-world data (see Figure 6(b)). If suboptimal embeddings show quite different forecast performances, the weighted average based on the error distribution[6] may work better. We can also apply the expert advice framework[24] if better forecasts are assumed to be changed over time.

We need to select the values of several parameters to apply the proposed framework, i.e., the number of split datasets , the number of the suboptimal embeddings for each batch , the criteria of the Hamming distance , the number of filters , and their components. We can control the variations in forecasts with these parameters. In order to set , we consider the number of samples in each divided dataset; empirically, the framework works well with more than 50 samples for each divided dataset. Samples that are too small prevent the framework from finding good embeddings, but samples that are too large reduce the diversity of the embeddings. Although can be any large integer (up to the total number of individuals of the evolution strategy), is sufficient for most cases. Note that the value of that is too large is overkill and unnecessarily increases the computational cost. The criteria for the Hamming distance directly controls the similarity of suboptimal embeddings. We set in this study, and it worked fine for all cases. Note that conventional MVE and RDE guarantee that the Hamming distance among all embeddings is larger than two (i.e., ) for a fixed embedding dimension. In practice, appropriate filters depend on the prediction steps. If we predict shorter time periods, high-pass filters will work fine; if we predict longer time periods, low frequencies are needed to obtain fine forecasts, as the numerical experiments suggest. Although we treat high-pass filters in this study, low-pass filters may work for extremely noisy data. We need to understand the effects of these parameters and set the appropriate values within reason, , which is usually much smaller than the total number of embeddings evaluated in the first step.

Our numerical experiments suggest that the appropriate embeddings vary with the prediction steps. The results show that a smaller with a smaller is preferable for a short-term forecast, and a larger with a larger is preferable for a longer-term forecast. These results are consistent with those of an existing study[16] and with the fact that the longer-step forecasts need more complex map from the current inputs to the forecasts. Therefore, more information is needed to express the complex map.

Although we applied the -ES algorithm to obtain the suboptimal embeddings, it is worth considering other methods to solve the combinatorial optimization problem. We can apply other evolution algorithms, e.g., simulated annealing[25] and ant colony optimization[26]. Any algorithm may work as long as it can store the suboptimal embeddings through its optimization procedure. Testing and evaluating other algorithms are open problems.

Throughout this study, we used a variation of a local nonlinear prediction method to forecast the target variable. One of the most significant advantages of these local prediction methods is that we can compute in-sample forecasts very easily with a small computational cost; all we have to do is to exclude the current query from the nearest neighbors. In addition, owing to recent progress in approximate neighboring search algorithms (e.g., Refs. 27, 28), these algorithms are fast enough to apply the evolution algorithms. Moreover, it is also suitable to obtain suboptimal embeddings through the forecast errors because local nonlinear prediction methods are sensitive to the false nearest neighbors derived from invalid features. However, we can apply other regression methods, e.g., Gaussian process regression[10], as done in RDE[6], for further improvements. Note that these methods incur a very large computational cost to yield in-sample forecasts, as mentioned in Ref. 6. To reduce the computational cost, we can apply these methods to only the process in the second step and can approximate the in-sample error by the k-fold cross-validation error.

Methods

Delay embedding

Delay embedding is a method of reconstructing the original state space using the delay coordinates . Here, with the -dimensional observed time series , variable , and time lag , is defined as follows:

(9)

where is the embedding dimension, , and no duplication is allowed for any element. According to embedding theorems[1, 3], an attractor reconstructed by is an embedding with the appropriate . Note that the number of possible reconstructions increase combinatorially.

Optimizing embedding by an evolution strategy

One can optimize embedding to solve combinatorial optimization problems to treat embeddings as binary series[11, 12]. Here, we solve equation (2) by a combinatorial optimization problem. We treat embeddings as binary series, i.e., 1 meaning embedded and 0 meaning not embedded. For example, with and , the embedding can be expressed by the binary code : the first element 1 corresponds to to embed, the second element 0 corresponds to to embed, the third element 1 corresponds to to embed, and so on. The optimization of a binary series is a typical combinatorial optimization problem. In this paper, we apply the evolution strategy[13]. The parameter denotes the number of parents, denotes the number of offspring, and denotes plus selection, which means that the next parents are selected from both the previous parents and offspring. Precisely, the algorithm yields new offspring from the top-performing individuals selected from the previous parents and offspring. The algorithm is an elitist method since the algorithm retains the best individuals unless they are replaced by superior individuals. The algorithm tends to converge fast, but it is likely to become trapped at local optima. Therefore, the algorithm is suitable for our proposed framework. However, it is worth considering the application of other existing methods[29, 30, 31] to obtain suboptimal embeddings.

Forecasting through FIR filters

We apply FIR filters to the original time series to obtain various embeddings and forecasts. We apply the following linear transformation for each variable:

(10)

where ’s are filter coefficients with elements, , and and are the coefficients for standardization. After forecasting , we restore for the original time series. For , we restore using as follows:

(11)

For , we obtain the -steps-ahead forecast by substituting forecasts for the unobserved as follows:

(12)

We can restore for any to apply equation (12) recursively.

We can apply FIR filters such as the moving average filter to suppress noise. In contrast, an existing study[14] focuses on the filtered delay coordinates constructed by blending the derivatives of time series and the original one to search the nearest neighbors. Although these are a type of high-pass FIR filters, such filters improve the prediction accuracy in less noisy time series.

Overall algorithm of the proposed framework

We denote a multivariate time series as and the target variable to be forecasted as . The proposed forecasting framework is presented in Algorithm 1 in terms of the user-defined number of split training datasets , the number of top-performing embeddings chosen for each split , the linear filters, and the number of filters .

References

  • [1] Takens, F. Detecting strange attractors in turbulence. Lecture Notes in Mathematics, Berlin Springer Verlag 898, 366, DOI: 10.1007/BFb0091924 (1981).
  • [2] Sauer, T., Yorke, J. A. & Casdagli, M. Embedology. Journal of Statistical Physics 65, 579–616, DOI: 10.1007/BF01053745 (1991).
  • [3] Deyle, E. R. & Sugihara, G. Generalized theorems for nonlinear state space reconstruction. PLoS ONE 6, e18295, DOI: 10.1371/journal.pone.0018295 (2011).
  • [4] Ye, H. & Sugihara, G.

    Information leverage in interconnected ecosystems: Overcoming the curse of dimensionality.

    Science 353, 922–925, DOI: 10.1126/science.aag0863 (2016).
  • [5] Okuno, S., Aihara, K. & Hirata, Y. Combining multiple forecasts for multivariate time series via state-dependent weighting. Chaos: An Interdisciplinary Journal of Nonlinear Science 29, 33128, DOI: 10.1063/1.5057379 (2019).
  • [6] Ma, H., Leng, S., Aihara, K., Lin, W. & Chen, L. Randomly distributed embedding making short-term high-dimensional data predictable. Proceedings of the National Academy of Sciences 115, E9994–E10002, DOI: 10.1073/pnas.1802987115 (2018).
  • [7] Sollich, P. & Krogh’", A. Learning with ensembles: how over-fitting can be useful. In Advances in neural information processing systems, 190–196 (1996).
  • [8] Kuncheva, L. I. & Whitaker, C. J.

    Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy.

    Machine Learning 51, 181–207, DOI: 10.1023/A:1022859003006 (2003).
  • [9] Lorenz, E. N. Atmospheric predictability as revealed by naturally occurring analogues. Journal of the Atmospheric Sciences 26, 636–646 (1969).
  • [10] Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, Cambridge, MA, 2006).
  • [11] Vitrano, J. B., Povinelli, R. J., B Vitrano, J. & Povinelli, R. J.

    Selecting dimensions and delay values for a time-delay embedding using a genetic algorithm.

    In

    Proceedings of the 3rd Annual Conference on Genetic and Evolutionary Computation

    , GECCO’01, 1423–1430 (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001).
  • [12] Small, M. Optimal time delay embedding for nonlinear time series modeling (2003). arXiv:nlin/0312011.
  • [13] Schwefel, H.-P. Numerical Optimization of Computer Models (John Wiley & Sons, Chichester, 1981).
  • [14] Okuno, S., Takeuchi, T., Horai, S., Aihara, K. & Hirata, Y. Avoiding underestimates for time series prediction by state-dependent local integration. Mathematical Engineering Technical Reports METR 2017–22, The University of Tokyo (2017).
  • [15] Lorenz, E. N. Predictability: a problem partly solved. In Seminar on Predictability, 1–18 (ECMWF, Reading, England, 1996).
  • [16] Chayama, M. & Hirata, Y. When univariate model-free time series prediction is better than multivariate. Physics Letters A 380, 2359–2365, DOI: https://doi.org/10.1016/j.physleta.2016.05.027 (2016).
  • [17] Kuramoto, Y. & Tsuzuki, T. Persistent Propagation of Concentration Waves in Dissipative Media Far from Thermal Equilibrium. Progress of Theoretical Physics 55, 356–369, DOI: 10.1143/PTP.55.356 (1976).
  • [18] Sivashinsky, G. I. Nonlinear analysis of hydrodynamic instability in laminar flames-I. Derivation of basic equations. Acta Astronautica 4, 1177–1206, DOI: 10.1016/0094-5765(77)90096-0 (1977).
  • [19] Dawson, C. et al. A comparative study of artificial neural network techniques for river stage forecasting. In Proceedings of the International Joint Conference on Neural Networks, vol. 4, 2666–2670, DOI: 10.1109/IJCNN.2005.1556324 (IEEE, Montreal, Canada, 2005).
  • [20] Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780, DOI: 10.1162/neco.1997.9.8.1735 (1997).
  • [21] Boser, B. E., Guyon, I. M. & Vapnik, V. N. A Training Algorithm for Optimal Margin Classifiers. In

    Proceedings of the Fifth Annual Workshop on Computational Learning Theory

    , COLT ’92, 144–152, DOI: 10.1145/130385.130401 (ACM, New York, NY, USA, 1992).
  • [22] Breiman, L. Random Forests. Machine Learning 45, 5–32, DOI: 10.1023/A:1010933404324 (2001).
  • [23] Rössler, O. E. An equation for continuous chaos. Physics Letters A 57, 397–398 (1976).
  • [24] Cesa-Bianchi, N. & Lugosi, G. Prediction, Learning, and Games (Cambridge University Press, 2006).
  • [25] Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated annealing. Science 220, 671–680 (1983).
  • [26] Dorigo, M. & Stützle, T. Ant Colony Optimization (Bradford Company, Scituate, MA, USA, 2004).
  • [27] Muja, M. & Lowe, D. G. Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36, 2227–2240, DOI: 10.1109/TPAMI.2014.2321376 (2014).
  • [28] Fu, C. & Cai, D.

    EFANNA : An extremely fast approximate nearest neighbor search algorithm based on kNN graph (2016).

    arXiv:1609.07228.
  • [29] Runge, J., Donner, R. V. & Kurths, J. Optimal model-free prediction from multivariate time series. Physical Review E 91, 1–14, DOI: 10.1103/PhysRevE.91.052909 (2015). arXiv:1506.05822.
  • [30] Vlachos, I. & Kugiumtzis, D. State space reconstruction from multiple time series. In Topics on Chaotic Systems: Selected Papers from Chaos 2008 International Conference, 378–387, DOI: 10.1142/9789814271349_0043 (World Scientific, 2009).
  • [31] Chen, Y. & Wong, M. L. An ant colony optimization approach for stacking ensemble. Second World Congress on Nature and Biologically Inspired Computing (NaBIC) 146–151, DOI: 10.1109/NABIC.2010.5716282 (2010).

Acknowledgments

We thank Professor Christian W. Dawson for permission to use the flood dataset. This research is partially supported by Kozo Keikaku Engineering Inc., JSPS KAKENHI 15H05707, WPI, MEXT, Japan, AMED JP19dm0307009, and JST CREST JPMJCR14D2, Japan. We thank Dr. Takahiro Omi for fruitful discussions.

Author contributions statement

S.O., K.A., and Y.H. conceived the original design of this study. S.O. and A.K. designed the study of the toy models, and S.O. and Y.H. designed the study of the flood data. S.O. conducted the numerical experiment. All authors contributed to interpreting the results and writing the manuscript. All authors checked the manuscript and agreed to submit the final version of the manuscript.

Additional information

Competing Interests: The authors declare that they have no competing interests.

Prediction steps [h] Proposed RDE MVE LSTM SVR Random Forest
6 0.068 0.069 0.060 0.123 0.073 0.079
12 0.167 0.174 0.171 0.213 0.177 0.185
18 0.247 0.278 0.273 0.291 0.268 0.275
24 0.349 0.389 0.373 0.372 0.365 0.363
Table 1: RMSEs of the flood dataset computed using the proposed framework, randomly distributed embedding (RDE), and multiview embedding (MVE) and also compared with those obtained by long short-term memory (LSTM), support vector regression (SVR), and the random forest. The proposed framework achieved the best accuracy for the 12-, 18-, and 24-h-ahead forecasts. Although the result obtained by the proposed framework for the 6-h-ahead forecast is the second best, the difference is slight.
1:Initialize suboptimal embeddings First step: Obtaining suboptimal embeddings
2:for each in  do
3:     Initialize a set of the Hall of Fame embeddings
4:     Initialize a set of embeddings (population)
5:     for each generation do
6:         Compute by map
7:         Evaluate the fitness by
8:         Append the elements of to
9:         Create a new for the next generation using an evolution strategy
10:     end for
11:     Sort by fitness
12:     for each in  do
13:         if  then Append to with filters
14:         if  embeddings are appended to then break
15:     end for
16:end for
17:for each in  do Second step: Combining forecasts to minimize the in-sample error
18:     Compute by map
19:     Evaluate the in-sample error of
20:end for
21: := Sort by the in-sample error of the th step
22:for each in the indices of  do
23:     Compute the combined forecast by a recursive formula
24:     Evaluate the in-sample error of
25:     Obtain the optimal number to combine
26:end for
27:while  do Forecast unobserved samples by embeddings
28:     for  to  do
29:         Compute
30:     end for
31:     Yield the combined forecast by
32:end while
Algorithm 1 Proposed forecasting framework