Worldwide, floods are considered as one of the most common and naturally distributed risks to life and property . According to , in 2017, flood were the most influential disaster with respect to number of people affected - 59.6% of people affected by natural disasters were affected by flood. Due to its burstiness and uncertainties, floods remain to be a comparatively hard-to-prevent disaster, and more advanced controls methods are eagerly needed. Among them, flood forecasting is always a crucial one. A timely and precise advance warning allows ample time for more mitigating actions and less damage by the disaster. However, when it comes to medium or small rivers, various problems exert excess challenges on the forecasting process. Due to the low capacity of those rivers, floods often abrupt in appearance, rapid in confluence, and short in forecast period . Thus, more sophisticated forecast methods are always in high demand.
Main traditional stream flow forecasting methods are those which employ physical hydrologic models or traditional machine learning algorithms. Hydrologic models that use the data of river stage, stream flow, or runoff volumes to forecast floods are mainly based on mathematical and physical analysis of hydrologic process, thus they are usually deterministic, and forecast results are normally exhibited as time series of estimates. Results of these models are often easily deteriorated if the data fed in contain certain degree of error or environmental noise . With the development of artificial intelligence and the approach of big data, researchers began to use ‘data-driven’ models – instead of mathematic or physic models - to study various aspects of hydrological phenomenon. ‘Data-driven’ models focus less on the exact logic and physic theories behind the forecast and more on the potential relationships lying inside the huge amount of data, thus remarkably reduce the amount of work done due to the non-linear feature and noise complexity of hydrological models, and improve the accuracy of the forecast. However, traditional machine learning models manipulate every input and output in a discrete manner, thus have limited performance in the area of prediction, which involves time-series data and every piece of data at a certain time has relationship with recent data. Prediction result of most of traditional models are accompanied by considerable errors.
As described above, physical models and traditional machine learning algorithms both have limited performance due to: 1, erroneous and chaotic data, and 2, special feature of time-series prediction. In this paper, we use the LSTM (Long Short-Term Memory) model – a kind of circular memory neural networks developed from RNN (Recurrent Neural Network) – in stream flow prediction to try to solve above two problems. As a sophisticated machine learning model, LSTM works well in dealing with chaotic data resulting from the complexity of real environment, and instability of medium or small rivers. Moreover, our prediction method involving the use of LSTM has innovation comparing to methods using traditional models, in the way that LSTM has “memory”, and every output is based on previous outputs, thus has ability to take advantage of the information “between” time-series data, and works better in predicting the stream flow changes that is a trend along time. The main idea of this paper is to use LSTM to analysis big amount of stream flow data, accompanied by rainfall data collected from various precipitation stations along the rivers to estimate the future stream flow of a certain spot in a river, and compare the prediction result with traditional machine learning model SVR (Support Vector Regression) and deep learning model MLP (Multilayer Perceptions). The results of the comparative experiment conducted in this paper proved that LSTM model contributes to stream-flow forecasting of small rivers with respect to:
Better model stability. Different from other two models, LSTM performs forecast that does not produce frequent and obvious fluctuations of stream flow line in cases of small rainfalls.
Better model reliability. LSTM is more accurate in forecasting stream flow peaks, which is vital to early warning of floods.
More intelligent in capturing the features of data. By extended experiments, we observed that LSTM is able to read different combinations of input data, including history stream flow volume, rainfall data, and areal rainfall data, and improve model accuracy based on all of them.
The rest of this paper is organized as follows. In section 2, works related to development and current situation of stream flow forecasting are listed. In section 3, the RNN model, which is the origin and foundation of LSTM, and the LSTM model are introduced. Then the complete experiment process of testing the performance of LSTM, including data preparation, model training, comparative models selecting, evaluation criteria choosing, final results, and extended experiments of LSTM performance are presented in section 4. At last, conclusion comes out in section 5.
Ii Related Works
In recent years, there are more and more data-driven AI model stream flow forecasting methods that are developed and put into practice. According to Yaseen, et al. 
, internationally there are mainly five areas of focus: ANN (Artificial neural network), SVM (Support vector machine), Fuzzy (Fuzzy logic method), EC (Evolutionary computing methods), and W-AI (Wavelet-complementary modeling).
An ANN is a kind of Artificial intelligence information processing system that resembles the biological neural networks of human brains . In 2002, Hsu et al. proposed the self-organizing linear output map (SOLO) – a kind of multivariate ANN procedure – to forecast rainfall-runoff. Cigizoglu  tested the performance of GRNN (Generalized regression neural network) regarding the intermittent daily mean flows forecasting and estimation in 2005. In 2010, Kagoda et al. 
used RBFNN (Radial Basis Function Neural Network) to perform 1-day forecasts of stream-flow and proved that it is a relatively more superior method.
SVM is popularized in last 20 years as an effective method solving the noisy problems. In 2005, Sivapragasam and Liong  experimented the performance of SVM in stream-flow prediction and yielded promising results. Asefa et al.  used SVM approach to predict seasonal and hourly multi-scale stream-flow in 2006. In 2011, Noori et al.  assessed the input variables determination on the SVM model performance using PCA, Gamma test, and forward selection techniques for monthly stream flow prediction.
The theory of fuzzy sets was introduced by Lotfi A. Zadeh in 1965. Fuzzy has been used to deal with the uncertainties inside the variables in models. In 2007, a neuro-fuzzy model was introduced by El-Shafie et al.  to forecast the monthly basis inflow of the Nile river. In 2009, Özger  utilized the Mamdani and the Takagi–Sugeno (TS) fuzzy inference systems for stream-flow value prediction. Sanikhani and Kisi , in 2012, developed two different adaptive neuro-fuzzy (ANFIS) techniques to estimate monthly river flow.
EC (Evolutionary Computing) is the collective of Evolutionary Algorithms (EA) that are used in the process of selection, mutation, and reproduction on a population of individual structures that undergo evolution. In 1999, Savic et al. 
conducted the first research on the employment of Evolutionary Computing in the field of stream-flow modeling. The performance of Genetic Programming and ANN in stream-flow forecasting were compared by Makkeasornet al.  in 2008. In 2009, the river inflow prediction ability of LGP was investigated by Guven  and the comparison with MLP and GRNN methods was carried out, and the result proved that LGP had a better performance.
Wavelet Transform (WT) is a method that focuses on handling data of time series. Wavelet and neuro-fuzzy conjunction model was employed by Shiri and Kisi 
in 2010 to make daily, monthly, and yearly stream-flow model. In 2014, wavelet transform-genetic algorithm-neural network model (WAGANN) was proposed by Sahay and Srivastava for forecasting monsoon river flows one day ahead.
Iii-a Recurrent Neural Network (RNN)
First developed in 1980s, RNN obtained its specialness due to its structure: the neurons are connected with each other and self-looped, thus the structure is able to display dynamic temporal behaviors and “remember” the information from last process. The basic and classic logic of RNN is presented below :
In one unit, is the input, ht represents the hidden state, and is the output. The subscript represents time. Firstly, hidden state output from last time is combined with current input (each with the weights and ), the result of which is transformed by a nonlinear function - or sigmoid, conventionally – and then fed into the hidden state. Then, the hidden state takes its weight , transformed by another nonlinear function, and at last the result is accepted by . In this way, current output is affected by last hidden state, thus obtains “short memory”.
One significant problem of classical RNN is that, due to its looped feature, the error of backward propagation depends on the weights in an exponential manner. Thus, error signals of RNN vanish or blow up in long-term process .
Iii-B Long Short-Term Memory network (LSTM)
To solve the gradient blowing up or vanishing problem, LSTM was introduced by Hochreiter and Schmidhuber  in 1997, which used memory cells and gates to control the long-term information saved in the network – keep or through away.
and are the weights of input into different gates: input gate (), input modulate gate (), forget gate (), and output gate ().
is bias vectors,is cell state, and is hidden state. All these controllers determine how much information to receive from the last loop, and how much to pass to the new state.
By actively choosing useful information to store and others to reject, LSTM provides a solution to the gradient explosion and vanishing problem faced by RNN.
In this section, the complete experiment process of the stream-flow forecast of the rivers in Tunxi, China using Artificial Intelligence data-driven model is presented, including data preparation, model training, comparative models selecting, evaluation criteria choosing, final results, and expended experiments of LSTM performance.
Iv-a Data Collection and Division
The data for the experiment is collected from Tunxi District, Huangshan City, Anhui Province, China. According to , Tunxi catchment has a drainage area of 2696.76 km2. Its altitude is low in east and increases gradually towards west. As affected by continental monsoon climate, the rainfall differs a lot between years. In one year, the rainfall also has uneven separation. More than 50% of the annual precipitation happens between April and June. Stream-flow changes in Tunxi area have the feature of small rivers: complexity and abruptness, which is suitable to test the forecast ability of models.
The experiment data consists of the stream-flow volume data of Tunxi which was collected from a hydrologic station, and rainfall data from 11 precipitation stations located on the upstream of the hydrologic station. There are in total 18648 pieces of data collected from 1981 to 2003.
Iv-B Data Pre-processing
The experiment will use the stream-flow data and precipitation data from 11 rainfall stations in the past 12 hours to forecast the stream-flow volume of the 6th
hour in the future. In order to transform the raw data into the form suitable for supervised learning, in this experiment, a series_to_supervised function is used. After the transformation, the data turns into the form as shown in Tab.I.
Q(t+X) represents the stream-flow data from (t-12) to (t+5), which means from 12 hours in the past to 6 hours in the future. P1(t+X) to P11(t+X) represents the precipitation data of the 11 rainfall stations from (t-12) to (t+5). Then, the 1st-144th columns (from Q(t-12) to P11(t-1)) are selected to be the features (x set), which contain the stream-flow and all the precipitation data of the past 12 hours. The 205th column (Q(t+5)) is selected to be the target (y set), which is the stream-flow data of the 6th hour in the future.
When the data are transformed into time-series format, only those which have enough data in front of and after it to form a time series are kept. Thus, some rows at the beginning or in the end are thrown away. After the 12 - 6 transformation mentioned above, 18237 pieces of data are kept. They are divided by an around 7:3 ratio - 13000 pieces are used as a training set, and 5237 used as a test set.
As the data of this experiments is collected from different stations and through a large time span, the dimension of the different sets of data are not the same. In order for the models to have better performance, the data goes through a normalization process using the MinMaxScaler function in the sklearn package, and is unified to [0,1]. The formula follows:
Iv-C Model Training
The LSTM model used in the experiment is based on keras library, the python deep learning library. The amount of hidden layer nodes is one of the parameters need to be determined in the model. By experiments, model with 64 nodes has the best performance. The optimizer, batch size and epochs are also parameters that influence the performance of the model. The choice of optimizer influences how the loss function is minimized, thus how the model heads to the final outcome. Standard choices include momentum, Adagrad, RMSProp, Adam, etc. By experiments, the Adam optimizer is chosen. Batch size affects the amount of data processed at a time. Through batches, the model updates multiple times before processing the whole dataset and thus the dynamics of the process is affected. As small batch size greatly slows down training speed and big batch size causes overfitting, on balance the batch size is set to 72 in this experiment. Epochs are the times the model runs through the whole data. According to Fig.1, when epochs are approximately 30, the loss of test set is the lowest. So, the epochs are set to 30 in this experiment.
Iv-D Comparative Models Selecting
To evaluate the performance of the proposed model LSTM, another two models are chosen to be comparative models. The former is a traditional machine learning method, while the latter is a deep learning method.
SVR: Support Vector Regression is a model derived from Support Vector Machine. According to 
, ”The idea of SVR is based on the computation of a linear regression function in a high dimensional feature space where the input data are mapped via a nonlinear function.” The kernel function and two parameters - C and gamma - should be determined for model setup. In this paper, RBF kernel function is selected, and by grid search, the combination of (C=0.095, gamma=0.165) is chosen.
Iv-E Evaluation Criteria
Three metrics are used in this paper as the evaluation criteria: root mean square error (RMSE), median absolute error (MAE), and coefficient of determination (R2).
RMSE is a common measurement method to show the difference between value predicted and value observed. Its formula is the following:
Where denotes the total number of values, denotes the value predicted, and denotes the value observed. The square root uniforms the outcome (error) scale with the input scale. RMSE value is always non-negative. A lower RMSE value means a better prediction.
MAE works in a similar way to RMSE except that the error is linear. Its formula is the following:
Since it works in a linear way, MAE does not penalize big errors more than small errors, but present them as they were. Similar to RMSE, MAE value is always non-negative, and a lower MAE value means a better prediction.
R2, or coefficient of determination, is a metric based on MSE (MSE is the square of RMSE). It differs from the preceding two metrics in that the scale of outcome does not depend on scale of input. The formula is the following:
is the mean of all values predicted. The denominator is the total variation of the predicted values. In most of the cases, R2 value is in range [0,1], and a higher value means a better prediction.
Feed the data and run the models, the results in the form of errors of SVR, MLP, and LSTM model are in Tab. II.
The prediction results in the form of graphs are in Fig. 2.
From Tab. II, we can figure out that LSTM has the best performance among the three models with respect to all three evaluation criteria. Thus, statistically the LSTM model has the best prediction accuracy. Fig. 2 shows that the LSTM prediction result of stream-flow data almost excellently fits the actual situations. Different from the SVR and MLP predictions, LSTM prediction does not yield obvious nonexistent small peaks or valleys. Moreover, with respect to prediction of major stream flow peaks, LSTM model is considerably better than MLP model, and slightly better than SVR model.
The results show that the remember-forget ability of LSTM greatly helps the model to predict non-linear and time-series data and have a relatively better performance on the forecast of stream-flow of rivers. However, LSTM still have errors in the major peak prediction – most of the major peak predictions exceeds the actual value by approximately 10 per cent. Better results may be achieved through adjustment of training process or larger and better available data base.
Iv-G Extended Experiments for LSTM
Iv-G1 Combinations of input data
Tab. III shows the error of LSTM models fed with different combinations of input data, while all other conditions stay the same as in the standard experiment. The result shows that history stream-flow data play a significant role in the accuracy of forecast, but rainfall (and areal rainfall) data are also indispensable. However, upon the presence of rainfall-related data, different combinations of rainfall data types do not pose a large difference on the result. Rainfall data are relatively more helpful than areal rainfall data.
Iv-G2 Change of predict time step
The predict time step is how far in the future does the LSTM model predict. The standard model in this paper has a predict time step of 6. That is, upon receiving the newest data, the model gives out predictions for the 6th hour in the future from now. Fig. 3 shows the values of three evaluation criteria of LSTM models with different predict time step, while all other conditions stay the same as the standard model in this paper. The results imply that predict time step has a negative correlation with the accuracy of the model, which makes sense since it’s harder for models to predict further into the future.
Iv-G3 Change of encoder time step
The encoder time step is the number of hours of history data fed into the LSTM model. The encoder time step of the standard model in this paper is 12. Fig. 4 shows the errors of models with different encoder time steps. Approximately, models with encoder time step in the range of [12,14] have the best forecast accuracy.
|Evaluation Criteria||SVR model||MLP model||LSTM model|
Forecast is always critical in saving human’s lives and properties from the flood disaster. This paper proposed a method of stream-flow forecast using LSTM network – a kind of deep learning neural network derived from RNN, equipped with a remember-forget system to avoid parameter blowing up or vanishing. To prove its advantage in time-series forecast with non-linear features, it is compared to the machine learning SVR model and deep learning MLP model in forecasting the stream-flow of Tunxi, China. Results of the experiment show that LSTM model provides more stable and more accurate prediction comparing to SVR and MLP models, proving its ability.
However, there is still room for improvement in the LSTM stream-flow forecasting model: the results show errors in peak volume forecast which cannot be ignored. The models may be improved in the following ways: First, Due to the limit of time and hardware capacity, the parameter choices of LSTM model are only based on simple tests and lack of thorough study. Moreover, most of default parameters of the model remain in their original value without adjustments. More study on the parameter adjustment may improve the model’s accuracy. Second, the data used in the experiment have a time span of more than 20 years. Due to the lack of technology and management in the past, the original data have a certain degree of disorder and deficiency, and various kinds of amendments are made to the data. Feeding data with higher quality may improve the model’s performance.
-  (2006) Multi-time scale stream flow predictions: the support vector machines approach. Journal of hydrology 318 (1-4), pp. 7–16. Cited by: §II.
-  (2013) Parametric and physically based modelling techniques for flood risk and vulnerability assessment: a comparison. Environmental modelling & software 41, pp. 84–92. Cited by: §I.
-  (2007) Support vector regression. Neural Information Processing-Letters and Reviews 11 (10), pp. 203–224. Cited by: 1st item.
Auto-association by multilayer perceptrons and singular value decomposition. Biological cybernetics 59 (4-5), pp. 291–294. Cited by: 2nd item.
-  (2005) Application of generalized regression neural networks to intermittent flow forecasting and estimation. Journal of Hydrologic Engineering 10 (4), pp. 336–341. Cited by: §II.
-  (2019) Accurate prediction of streamflow using long short-term memory network: a case study in the brazos river basin in texas. International Journal of Environmental Science and Development 10 (10). Cited by: §I.
-  (2015) Flood forecasting model based on geographical information system. Proceedings of the International Association of Hydrological Sciences 368, pp. 192–196. Cited by: §IV-A.
-  (2007) A neuro-fuzzy model for inflow forecasting of the nile river at aswan high dam. Water resources management 21 (3), pp. 533–556. Cited by: §II.
-  (2018) A hydrologic forecast method based on lstm-bp. Computer and Modernization (7), pp. 19. Cited by: §I.
-  (2009) Linear genetic programming for time-series modelling of daily flow rate. Journal of earth system science 118 (2), pp. 137–146. Cited by: §II.
-  (2017) Bayesian flood forecasting methods: a review. Journal of Hydrology 551, pp. 340–351. Cited by: §I, §II.
-  (1994) Neural networks: a comprehensive foundation. Prentice Hall PTR. Cited by: §II.
-  (2002) Self-organizing linear output map (solo): an artificial neural network suitable for hydrologic modeling and analysis. Water Resources Research 38 (12), pp. 38–1. Cited by: §II.
-  (2010) Application of radial basis function neural networks to short-term streamflow forecasting. Physics and Chemistry of the Earth, Parts A/B/C 35 (13-14), pp. 571–581. Cited by: §II.
-  (2008) Short-term streamflow forecasting with global climate change implications–a comparative study between genetic programming and neural network models. Journal of hydrology 352 (3-4), pp. 336–354. Cited by: §II.
-  (2011) Assessment of input variables determination on the svm model performance using pca, gamma test, and forward selection techniques for monthly stream flow prediction. Journal of Hydrology 401 (3-4), pp. 177–189. Cited by: §II.
-  (2009) Comparison of fuzzy inference systems for streamflow prediction. Hydrological Sciences Journal 54 (2), pp. 261–273. Cited by: §II.
-  (2013) How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026. Cited by: §III-A.
-  (2014) Predicting monsoon floods in rivers embedding wavelet transform, genetic algorithm and neural network. Water resources management 28 (2), pp. 301–317. Cited by: §II.
-  (2012) River flow estimation and forecasting by using two different adaptive neuro-fuzzy approaches. Water resources management 26 (6), pp. 1715–1729. Cited by: §II.
-  (1999) A genetic programming approach to rainfall-runoff modelling. Water Resources Management 13 (3), pp. 219–231. Cited by: §II.
-  (1997) Long short-term memory. Neural Comput 9 (8), pp. 1735–1780. Cited by: §III-A, §III-B.
-  (2010) Short-term and long-term streamflow forecasting using a wavelet and neuro-fuzzy conjunction model. Journal of Hydrology 394 (3-4), pp. 486–493. Cited by: §II.
-  (2005) Flow categorization model for improving forecasting. Hydrology Research 36 (1), pp. 37–48. Cited by: §II.
-  (2018) Natural disasters in 2017: lower mortality, higher cost. Brussels, Belgium: Centre for Research on the Epidemiology of Disasters. Cited by: §I.
-  (2015) Artificial intelligence based models for stream-flow forecasting: 2000–2015. Journal of Hydrology 530, pp. 829–844. Cited by: §II.
-  (2018) Developing a long short-term memory (lstm) based model for predicting water table depth in agricultural areas. Journal of hydrology 561, pp. 918–929. Cited by: §III-A.