1 Introduction
Digital payments, mobile banking apps, and digital money management tools such as personal financial management apps now have a strong presence in the financial industry. There is an increasing demand for tools which bring higher interaction efficiency, improved user experience, allow for better browsing using different devices and take or help taking automatic decisions.
The Machine Learning community has shown early signs of interest in this domain. For instance, several public contests have been introduced since 2015. This notably includes a couple of Kaggle challenges^{1}^{1}1https://www.kaggle.com/c/santanderproductrecommendation^{2}^{2}2https://www.kaggle.com/c/sberbankrussianhousingmarket^{3}^{3}3https://www.kaggle.com/c/bnpparibascardifclaimsmanagement and the 2016 ECML Data Discovery Challenge. Some studies have addressed Machine Learning for financial products such as recommendation prediction [19], location prediction [22] or fraud detection [20].
Industrial context. Our present research has been carried out within the abovementioned context of digital money management functions offered via our organization’s mobile banking app and web. This includes the latest “expense and income forecasting” tool which, by using large amounts of historical data, estimates customers’ expected expenses and incomings. On a monthly basis, our algorithm detects recurrent expenses ^{4}^{4}4from now on, we will refer to this as ’expenses’ because our models treat income as “negative expenses”. (either specific operations or aggregated amounts of expenses within a specific category). We will feed the mobile app with the generated results, which enables customers to anticipate said expenses and, hence, plan for the month ahead.
This function is currently in operation on our mobile app and website. It is available to 5M customers, and generates hundreds of thousands of monthly visits. Fig. 1 displays a screenshot.
The tool consists of several modules, some of which involve Machine Learning models. A specific problem solved by one of the modules is to estimate the amount of money a customer will spend in a specific financial category in the upcoming month. We have financial category labels attached to each operation. Categories include financial events, as well as everyday products and services such as ATM withdrawals, salary, or grocery shopping, among others (as we will explain hereafter). This problem can be expressed as a regression problem, where the input is a set of attributes of the customer history, while the output is the monetary amount to be anticipated. However, the problem presents several unique challenges. Firstly, as we work with monthly data, the time series are short, and limited to a couple of years; thus, we fail to capture longterm periodicities. Secondly, while personal expenses data can exhibit certain regular patterns, in most cases series can appear erratic or can manifest spurious spikes; indeed, it is natural that not all factors for predicting a future value are captured by past values.
Indeed, preliminary tests with classical time series methods such as the HoltWinters procedure yielded poor results. One hypothesis is that classical time series method perform inference on a perseries bases and therefore require a long history as discussed in [17].
With the aim of evolving this solution, we asked ourselves whether Deep Learning methods can offer a competitive solution. Deep Learning algorithms have become the stateoftheart in fields like computer vision or automatic translation due to their capacity to fit very complex functions
to large data sets of pairs .Our proposed solution. This article recounts our experience in solving one of the main limitations of Deep Learning methods for regression: they do not take into account the variability in the prediction. In fact, one of the limitations is that the learned function
provides a pointwise estimate of the output target, while the model does not predict a probability distribution of the target or a range of possible values. In other words, these algorithms are typically incapable to assess how confident they are concerning their predictions.
This paper tackles a realworld problem of forecasting approaching customer financial expenses in certain categories based on the historical data available. We pose the problem as a regression problem, in which we build features for a user history and fit a function that estimates the most likely expense in the subsequent time period (dependent variable), given a data set of user expenses. Previous studies have shown that Deep Networks provide lower error rates than other models [8].
At any rate, Neural Networks are
forced to take a decision for all the cases. Rather than minimising a forecast error for all points, we seek mechanisms to detect a small fraction of predictions for each user where the forecast is confident. This is done for two reasons. Since notifying a user about an upcoming expense is a valueadded feature, we ask ourselves whether it is viable to reject forecasts for which we are not certain. Furthermore, as a user may have dozens of expense categories, this is a way of selecting relevant impending expenses.To tackle the issue of prediction with confidence, potential solutions in the literature combine the good properties of Deep Learning to generically estimate functions with the probabilistic treatments including Bayesian frameworks of Deep Learning ([18], [4], [3], [10]). Inspired by these solutions, we provide a formulation of Deep regression Networks which outputs the parameters of a certain distribution that corresponds to an estimated target and an inputdependent variance (i.e. Heteroscedastic) of the estimate; we fit such a network by maximum likelihood estimation. We evaluate the network by both its accuracy and by its capability to select “predictable” cases. For our purpose involving millions of noisy time series, we performed a comprehensive benchmark and observed that it outperforms nontrivial baselines and some of the approaches cited above.
2 Method
This section presents the details of the proposed method.
2.1 Deep Learning as a point estimation
A generic and practical solution to tackle the regression problem is Deep Learning as it is able to approximate any continuous function following the universal approximation theorem [16].
Straightforward implementations of Deep Learning models for regression do not typically consider a confidence interval (or score) of the estimated target. However, with our particular setting, it is critical that the model identifies the reliable forecasts, and can ignore forecasts where userspending gets noisier. Furthermore, if we can find a correlation between the noisy timeseries and the error of prediction, a strategy could be devised to improve the general performance of the system.
In this paper we denote as the function computed by the Neural Network, with a vector of inputs (the attributes) and the output, i.e. the forecasted amount. We will denote the set of all weights by .
In particular, we will use two standard types of layers. On the one hand, we consider nonlinear stacking of Dense layers. A Dense layer is a linear combination of the weights with the input of the layer
passing through an activation function, act, i.e.
On the other hand, a LSTM layer ([15], [11]) which, like the other typical recurrent layers [6], maps a timeseries with steps to an output that can have steps or fewer depending on our needs. In order to do this, for each of the steps, the input will be combined at step with the same weights and nonlinearities for all the steps in the following way:
Where are the weights of the cell (shared for all the
steps) and sig are Sigmoid functions.
In the experimental section, we consider different Neural Network architectures by stacking Dense and/or LSTM layers. Fitting the network weights
involves minimizing a loss function, which is typically based on the standard
backpropagationprocedure. Nowadays, several optimizers are implemented in standard Deep Learning libraries which build on automatic differentiation after specifying the loss. All our Deep Learning models are implemented in Keras
[7]with TensorFlow
[1] backend and specific loss functions are discussed below.2.2 Types of uncertainty
With our system it will be critical to make highly accurate predictions, and, at the same time, assess the confidence of the predictions. We use the predictions to alert the user to potential impending expenses. Because of this, the cost of making a wrong prediction is much higher than not making any prediction at all. Considering the biasvariance tradeoff between the predicted function and the real function to be predicted , the sources of the error could be thought as:
To achieve a mechanism to perform rejection of predictions within the framework of Deep Learning, we will introduce the notion of variance in the prediction. According to the Bayesian viewpoint proposed by [9], it is possible to characterise the concept of uncertainty into two categories depending on the origin of the noise. On the one hand, if the noise applies to the model parameters, we will refer to Epistemic uncertainty (e.g. Dropout, following [10], could be seen as a way to capture the Epistemic uncertainty). On the other hand, if the noise occurs directly in the output given the input, we will refer to it as Aleatoric uncertainty. Additionally, Aleatoric uncertainty can further be categorised into two more categories: Homoscedastic uncertainty, when the noise is constant for all the outputs (thus acting as a “measurement error”), or Heteroscedastic uncertainty when the noise of the output also depends explicitly on the specific input (this kind of uncertainty is useful to model effects such as occlusions / superpositions of factors or variance in the prediction for an input).
2.3 A generic Deep Learning regression Network with Aleatoric uncertainty management
This section describes the proposed framework to improve the accuracy of our forecasting problem by modelling Aleatoric uncertainty in Deep Networks.
The idea is to pose Neural Network learning as a probabilistic function learning. We follow a formulation of Mixture Density Networks models [3], where we do not minimise the typical loss function (e.g. mean squares error), but rather the likelihood of the forecasts.
, they define a likelihood function over the output of a Neural Network with a Normal distribution,
, where is the Neural Network function and is the variance of the Normal distribution. However, there is no restriction that does not allow us to use another distribution function if it is more convenient for our very noisy problem. So we decided to use a Laplace distribution defined as(1) 
which has similar properties and two (location and scale) parameters like the Normal one but this distribution avoids the square difference and square scale denominator of the Normal distribution with an empirically unstable behaviour in the initial points of the Neural Network weights optimisation or when the absolute error is sizeable. Furthermore, because of the monotonic behaviour of the logarithm, maximising the likelihood is equivalent to minimising the negative logarithm of the likelihood (Eq. 1), i.e. our loss function, , to minimise will be as follows
(2) 
where are the weights of the Neural Network to be optimised.
In line with the above argument, note that captures the Aleatoric uncertainty. Therefore, this formulation applies to both the Homoscedastic and Heteroscedastic case. In the former case, is a single parameter to be optimised which we will denote .
In the latter, the is a function that depends on the input. Our assumption is that the features needed to detect the variance behaviours of the output are not directly related with the features to forecast the predicted value. Thus, can itself be the result of another inputdependent Neural Network , and all parameters of and are optimised jointly. It is important to highlight that is a strictly positive scale parameter, meaning that we must restrict the output of the or the values of the to positive values.
3 Experimental settings and results
3.1 Problem setting
Our problem is to forecast upcoming expenses in personal financial records with a data set constructed with historical account data from the previous 24 months. The data was anonymized, e.g. customer IDs are removed from the series, and we do not deal with individual amounts, but with monthly aggregated amounts. All the experiments were carried out within our servers. The data consists of series of monetary expenses for certain customers in selected expense categories.
We cast this problem as a rollingwindow regression problem. Denoting the observed values of expenses in an individual series over a window of the last months as , we extract attributes from . The problem is then to estimate the most likely value of the th month from the attributes, i.e fit a function for the problem , with , where we made explicit in the notation that the attributes depend implicitly on the raw inputs. To fit such a model we need a data set of instances of pairs , i.e. .
To illustrate the nature and the variability of the series, in Fig. 3 we visualise the values of the raw series grouped by the clusters resulting from a means algorithm where
. Prior to the clustering, the series were centered by their mean and normalised by their standard deviation so that the emergent clusters indicate scaleinvariant behaviours such as periodicity.
As seen in Fig. 3
, the series present clear behaviour patterns. This leads us to believe that casting the problem as a supervised learning one, using a large data set, is adequate and would capture general recurrent patterns.
Still, it is apparent that the nature of these data sets presents some challenges, as perceived from the variability of certain clusters in Fig 3, and in some of the broad distributions of the output variable. In many individual cases, an expense may result from erratic human actions, spurious events or depending on factors not captured by the past values of the series. One of our objectives is to have an uncertainty value of the prediction to detect those series  and only communicate forecasts for those cases for which we are confident.
3.1.1 Data preprocessing details
We generate monthly time series by aggregating the data over the contract id + the transaction category label given by an automatic internal categorizer which employs a set of strategies including text mining classifiers. While this process generates dozens of millions of time series in production, this study uses a random sample of 2 million time series (with
) for the training set, and 1 million time series for the test set. In order to ensure a headtohead comparison with the existing system, the random sample is taken only from the subset of series which enter the forecasting module (many other series are discarded by different heuristics during the production process). It is worth noticing that most of these series have nonzero target values. From these series we construct the raw data
and the targets , from which we compute attributes.As attributes, we use the values of the series, normalized as follows:
Where is a threshold value.
We also add the mean and standard deviation as attributes, which can be written as: .
The rationale for this choice is: (i) important financial behaviours such as periodicity tend to be scaleinvariant, (ii) the mean recovers scale information (which is needed as the forecast is an real monetary value), (iii) the spread of the series could provide information for the uncertainty. We converged to these attributes after a few preliminary experiments.
3.2 Evaluation measures
We evaluate all the methods in terms of an Error vs. Reject characteristic in order to take into account both the accuracy and the ability to reject uncertain samples. Given the point estimates and uncertainty scores of the test set, we can discard all cases with an uncertainty score above a threshold , and compute an error metric only over the set of accepted ones. Sweeping the value of gives rise to an Error vs. Keep tradeoff curve: at any operating point, one reads the fraction of points retained and the resulting error on those – thus measuring simultaneously the precision of a method and the quality of its uncertainty estimate.
To evaluate the error of a set of points, we use the wellknown mean absolute error (MAE):
which is the mean of absolute deviations between a forecast and its actual observed value. To specify explicitly that this MAE is computed only on the set of samples where the uncertainty score is under , we use the following notation:
(3) 
And the kept fraction is simply . Note that when is the maximum value.
Our goal in this curve is to select a threshold such that the maximum error is controlled; since large prediction errors would affect the user experience of the product. In other words, we prefer not to predict everything than to predict erroneously.
Hence, with the best model we will observe the behaviour of the MAE error and order the points to predict according to the uncertainty score for that given model.
In addition, we will add visualisations to reinforce the arguments of the advantages of using the models described in this paper that take into account Aleatoric uncertainty to tackle noisy problems.
3.3 Baselines and methods under evaluation
We provide details of the methods we evaluate as well as the baselines we compare are comparing against. Every method below outputs an estimate of the target along with an uncertainty score. We indicate in italics how each estimator and uncertainty score is indicated in Table 1. For methods which do not explicitly compute uncertainty scores, we will use the variance of the input series as a proxy.
Trivial baselines. In order to validate that the problem cannot be easily solved with simple forecasting heuristics such as a moving average, we first evaluate three simple predictors: (i) the mean of the series (mean), (ii) (zero), and (ii) the value of the previous month (last). In the three cases, we use the variance of the input as the uncertainty score (var).
Random Forest.
We also compare against the forecasting method which is currently implemented in the expense forecasting tool. The tool consists of a combination of modules. The forecasting module is based on random forest (
RF*). Attributes include the values of the 24 months () but also dozens of other carefully handdesigned attributes. The module outputs the forecasted amount and a discrete confidence label (low/medium/high), referred to as prec. For the sake of completeness, we also repeat the experiment using the variance of the input (var) as the confidence score.General Additive Model. We also compare against a traditional regression method able to output estimates and their distribution, specifically we use the Generalized Additive Models (GAM) [2] with its uncertainty score denoted as SE. GAM is a generalized lineal model with a lineal predictor involving a sum of smooth functions of covariates [12] that follows the following form,
where ,
is the response variable that follows some exponential family distribution,
is a row of the model matrix for any strictly parametric model components,
is the parametric vector and are smooth functions of the covariates, , and is a known link function. We fit GAM in a simple way: . The parameters in the model are estimated by penalized iteratively reweighted least squares (PIRLS) using Generalized Cross Validation (GCV) to control overfitting.Deep Learning Models. Finally, we will compare a number of Deep Learning models with various strategies to obtain the forecasted amount and an uncertainty score.
Dense networks. We start with a plain Dense Network architecture (referred to as Dense
). In particular, our final selected Dense model is composed by 1 layer of 128 neurons followed by another of 64 neurons, each of the layers with a ReLu activation, and finally this goes to a single output. We use the Mean Absolute Error (MAE) as loss function. Since a Dense regression Network does not provide a principled uncertainty estimate, again we use the variance of the series,
, as the uncertainty score.Epistemic Models. We also evaluate Bayesian Deep Learning alternatives, which offer a principled estimate of the uncertainty score. These carry out an Epistemic treatment of the uncertainty (see Section 2.2), rather than our Aleatoric treatment. We consider two methods: (i) adding Dropout [10] (with a 0.5 probability parameter of Dropout) (referred to as DenseDrop), and (ii) BayesByBackprop variation [4], denoted as DenseBBB. Both techniques model the distribution of the network weights, rather than weight values. This allows us to take samples from the network weights. Therefore, given a new input, we obtain a set of output samples rather than a single output; we can take the mean of those samples as the final target estimate and its standard deviation as our uncertainty score, i.e. . In Table 1, we denote the uncertainty score of Dropout and BBB as drop and BBB, respectively. For the sake of completeness, we also consider the simple case of the variance (var).
Homoscedastic and Heteroscedastic Models. Here we consider fitting a network under the Aleatoric model (see 2), both for the Homoscedastic and Heteroscedastic cases. It is important to highlight that, in both cases, the way used to restrict the values of to positive values was by applying a function to the values of the or output of defined as the translation of the ELU function plus 1,
In addition, in the Heteroscedastic case, we used the same Neural Network architecture for the “variance” approximation than for the “mean” approximation .
In the Homoscedastic case, since the variance is constant for all the samples, we have to resort to the variance of the series as the uncertainty scores. In contrast, the Heteroscedastic variance estimate is inputdependent and can thus be used as the uncertainty score (denoted as ) in Table 1.
All the Deep Learning models were trained during epochs with Early Stopping by using the of the training set as a validation set.
LSTM networks.
To conclude, we will repeat the experiment by replacing the Dense Networks with long shortterm memory (
LSTM) networks, which are more suited to sequential tasks. Here we consider both types of aletoric uncertainty (i.e. LSTMHom and LSTMHet). In this particular case, the architecture uses two LSTM layers of 128 neurons followed by a 128 neurons Dense layer, and finally a single output.3.4 Results
Models comparison. In Figure 4 we present a visual comparison between the ErrorKeep curves (Eq. 3) of RF* (indicating the points of low/medium/high confidence), RF* using var as confidence score, the GAM (which uses its own uncertainty estimate), and the best Dense and LSTM models (which, as will be detailed later, use the Heteroscedastic uncertainty estimates).
First we notice that the Heteroscedastic solutions are better than all the other previous solutions. This confirms our assumption that, for our given noisy problem, taking into account the variability of the output provides better accuracy at given retention rates.
Predictor + uncertainty  K=25%  K=41%  K=50%  K=75%  K=99.5%  K=100% 

+  
Zeros +  
Last +  
GAM +  
GAM +  
RF* + prec  N/A  N/A  N/A  
RF* +  
Dense +  
DenseDrop +  
DenseDrop + Drop  
DenseBBB +  
DenseBBB + BBB  
DenseHom +  
DenseHet +  
DenseHet +  
LSTM +  
LSTMHom +  
LSTMHet +  
LSTMHet + 
For a more detailed analysis, Table 1 shows the MAE values of all the compared methods for several cutoff points of the ErrorKeep curve, namely for keep=25%,50%,75%,100%, and additionally for 41% and 99.5% (for direct comparison to the currently implemented RF* method, because these are the cutoff values obtained by the RF* method with confidence=high and confidence=medium). Note that we are more interested in low values of keep (typically below 50%), as they correspond to “selecting” the most confident samples; but we show all the percentages for completeness.
For Deep Learning models, to avoid a known issue of sensitivity with respect to the random seeds, we repeated all experiments times with different random initialisations and report the mean and standard deviation over the 6 runs.
Table 1 shows different interesting aspects. First, we confirm that the best performing model is the LSTM network with the proposed Heteroscedastic uncertainty treatment (except in the points where keep 99.5%. However, these are not of practical importance because there is virtually no rejection, and the total number of errors are high for all the methods). Had we not considered LSTM networks as an option, the best performing model would still be a Dense Network with Heteroscedastic uncertainty. We also observed that the performance ranking of the incremental experiments between Dense and LSTM is consistent.
We also note that, for this particular task, the Aleatoric methods (both Homoscedastic and Heteroscedastic) perform better than Epistemic models such as Dropout or BBB. We believe this to be specific to the task at hand where we have millions of short and noisy time series. We conclude that the variability of human spending, with complex patterns, erratic behaviors and intermittent spendings, is captured more accurately by directly modelling an inputdependent variance with a complex function, than by considering model invariance (which may be better suited for cases where the data is more scarce or the noise smoother).
Another interesting observation is that the Random Forest baseline used more attributes than the proposed solutions based on Deep Learning, which just used values of the historical time series. This confirms that fitting a Deep Network on a large data set is a preferred solution; even more when being able to model the uncertainty of the data, as is the case here.
Last but not least, comparing the same uncertainty score with a certain model and its Heteroscedastic version we also observe that there is an accuracy improvement. This means that taking into account uncertainty in the training process, in our noisy problem, not only gives to us an uncertainty score to reject uncertain samples but also it helps in order to predict better.
ErrorUncertainty score correlation. An interesting question is whether or not there exists any relationship between the errors of a model and the uncertainty scores it provides. In Figure 5 we show the correlation between the MAE of
randomly selected points of the test set and the uncertainty score of the best Heteroscedastic model expressed in the logarithm scale. To reduce the effect of clutter, we used a colourmap for each point that represents the density of the different parts of the Figure as resulting from a Gaussian Kernel Density Estimate.
This figure exhibits two main regimes. On the one hand, for high uncertainty scores (), we observe scattered samples (considering their purple colour) where the errors range from very low to very high values and do not seem to align with the uncertainly scores. Upon inspection, many of these series correspond to what humans would consider ‘unpredictable”: e.g. an expense in month which was much larger than any of the observed expenses, or series without a clear pattern. On the other hand, we observe a prominent highconcentration area (the yellowish one) of lowerror and lowuncertainty values; indicating that, by setting the threshold under a specific value of the uncertainty score, the system mostly selects lowerror samples.
4 Related Work
In the case of using Deep Learning for classification, it is common to use a Softmax activation function in the last layer [5]. This yields a noncalibrated probability score which can be used heuristically as the confidence score or calibrated to a true probability [13]. However, in the case of regression, the output variables are not class labels and it is not possible to obtain such scores. On the other hand, [18] introduces the idea of combining different kinds of Aleatoric and Epistemic uncertainties. Nevertheless, as we saw in Table 1, the use of Epistemic uncertainty for our problem worsens the accuracy and the general performance of the model.
While there have been several proposals to deal with uncertainties in Deep Learning, they boil down to two main families. On the one hand, some approaches consider the uncertainty of the output. Typically, they construct a network architecture so that its output is not a point estimate, but a distribution [3]. On the other hand, some approaches consider the uncertainty of the model. These apply a Bayesian treatment to the optimisation of the parameters of a network ([10], [4], [14], [21]). In the present work we observe that if a highly noisy problem can change the loss function in order to minimise a likelihood function as in [3], and by introducing some changes explained above, we are provided with a significant improvement which is crucial to identifying forecasts with a high degree of confidence and even improve the accuracy.
We are dealing with a problem that contains high levels of noise. To make predictions is therefore risky. These are the two reasons why it was our goal to find a solution that improves the meanvariance solution that can be a “challenging” solution. In order to do that, we grouped several theories proposed in [18] and [3] to create a Deep Learning model that takes into account the uncertainty and provides a better performance.
5 Conclusion
We explore a new solution for an industrial problem of forecasting real expenses of customers. Our solution is based on Deep Learning models for effectiveness and solve the challenge of uncertainty estimation by learning both a target output and its variance, and performing maximum likelihood estimation of the resulting model that contains one network for the target output and another for its variance. We show that this solution obtains better errorreject characteristics than other (traditional and Deep) principled models for regression uncertainty estimation, and outperforms the characteristic that would be obtained by the current industrial system in place. While Epistemic models such as Dropout or BBB did not improve the performance in this specific task, we are already working in combining them with our Aleatoric treatment to consider both types of uncertainty in the same model. This is considered future work. We also highlight that, while the present model seems to be able to detect confident predictions, it still lacks mechanisms to deal with the “unknown unknowns” problem; and believe that incorporating ideas such as those in
[13] may help in future work.Acknowledgements
We gratefully acknowledge the Industrial Doctorates Plan of Generalitat de Catalunya for funding part of this research. The UB acknowledges the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU and recognizes that part of the research described in this chapter was partially funded by TIN201566951C2, SGR 1219. We also thank Alberto Rúbio and César de Pablo for insightful comments as well as BBVA Data and Analytics for sponsoring the industrial PhD.
References
 [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng., X.: TensorFlow: Largescale machine learning on heterogeneous systems (2015), https://www.tensorflow.org/, software available from tensorflow.org
 [2] AndersonCook, C.M.: Generalized additive models: An introduction with r (2007)
 [3] Bishop, C.M.: Mixture density networks (1994)
 [4] Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural networks. ICML (2015)

[5]
Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing, pp. 227–236. Springer (1990)
 [6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
 [7] Chollet, F., et al.: Keras. https://keras.io (2015)
 [8] Ciprian, M., Baldassini, L., Peinado, L., Correas, T., Maestre, R., J.A. RodriguezSerrano, Pujol, O., Vitrià, J.: Evaluating uncertainty scores for deep regression networks in financial short time series forecasting. Workshop on Machine Learning for Spatiotemporal Forecasting, in NIPS (2016)
 [9] Der Kiureghian, A., Ditlevsen, O.: Aleatory or epistemic? does it matter? Structural Safety 31(2), 105–112 (2009)
 [10] Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: ICML. pp. 1050–1059 (2016)
 [11] Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with lstm (1999)
 [12] Hastie, T.J., Tibshirani, R.J.: Generalized additive models. London: Chapman & Hall (1990)
 [13] Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and outofdistribution examples in neural networks. ICLR (2017)

[14]
HernándezLobato, J.M., Adams, R.: Probabilistic backpropagation for scalable learning of bayesian neural networks. In: ICML. pp. 1861–1869 (2015)
 [15] Hochreiter, S., Schmidhuber, J.: Long shortterm memory. Neural computation 9(8), 1735–1780 (1997)
 [16] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural networks 2(5), 359–366 (1989)
 [17] Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. International journal of forecasting 22(4), 679–688 (2006)
 [18] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: NIPS. pp. 5580–5590 (2017)
 [19] Mitrović, S., Singh, G.: Predicting branch visits and credit card upselling using temporal banking data. ECML/PKDD (2016)
 [20] Mutanen, T., Ahola, J., Nousiainen, S.: Customer churn predictiona case study in retail banking. In: Proc. of ECML/PKDD Workshop on Practical Data Mining. pp. 13–19 (2006)
 [21] Rasmussen, C.E.: A practical monte carlo implementation of bayesian learning. In: Advances in Neural Information Processing Systems. pp. 598–604 (1996)
 [22] Wistuba, M., DuongTrung, N., Schilling, N., SchmidtThieme, L.: Bank card usage prediction exploiting geolocation information. ECMLPKDD 2016 Discovery Challenge on Bank Card Usage Analysis (2016)