Uncertainty Modelling in Deep Networks: Forecasting Short and Noisy Series

07/24/2018 ∙ by Axel Brando, et al. ∙ Universitat de Barcelona BBVA Data & Analytics 0

Deep Learning is a consolidated, state-of-the-art Machine Learning tool to fit a function when provided with large data sets of examples. However, in regression tasks, the straightforward application of Deep Learning models provides a point estimate of the target. In addition, the model does not take into account the uncertainty of a prediction. This represents a great limitation for tasks where communicating an erroneous prediction carries a risk. In this paper we tackle a real-world problem of forecasting impending financial expenses and incomings of customers, while displaying predictable monetary amounts on a mobile app. In this context, we investigate if we would obtain an advantage by applying Deep Learning models with a Heteroscedastic model of the variance of a network's output. Experimentally, we achieve a higher accuracy than non-trivial baselines. More importantly, we introduce a mechanism to discard low-confidence predictions, which means that they will not be visible to users. This should help enhance the user experience of our product.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Digital payments, mobile banking apps, and digital money management tools such as personal financial management apps now have a strong presence in the financial industry. There is an increasing demand for tools which bring higher interaction efficiency, improved user experience, allow for better browsing using different devices and take or help taking automatic decisions.

The Machine Learning community has shown early signs of interest in this domain. For instance, several public contests have been introduced since 2015. This notably includes a couple of Kaggle challenges111https://www.kaggle.com/c/santander-product-recommendation222https://www.kaggle.com/c/sberbank-russian-housing-market333https://www.kaggle.com/c/bnp-paribas-cardif-claims-management and the 2016 ECML Data Discovery Challenge. Some studies have addressed Machine Learning for financial products such as recommendation prediction [19], location prediction [22] or fraud detection [20].

Industrial context. Our present research has been carried out within the above-mentioned context of digital money management functions offered via our organization’s mobile banking app and web. This includes the latest “expense and income forecasting” tool which, by using large amounts of historical data, estimates customers’ expected expenses and incomings. On a monthly basis, our algorithm detects recurrent expenses 444from now on, we will refer to this as ’expenses’ because our models treat income as “negative expenses”. (either specific operations or aggregated amounts of expenses within a specific category). We will feed the mobile app with the generated results, which enables customers to anticipate said expenses and, hence, plan for the month ahead.

This function is currently in operation on our mobile app and website. It is available to 5M customers, and generates hundreds of thousands of monthly visits. Fig. 1 displays a screenshot.

[capbesideposition=right,top,capbesidewidth=4cm]figure[]

Figure 1: Screenshots of BBVA’s mobile app showing expected incomes and expenses. Global calendar view (left) and expanded view of one of the forecasts (right).

The tool consists of several modules, some of which involve Machine Learning models. A specific problem solved by one of the modules is to estimate the amount of money a customer will spend in a specific financial category in the upcoming month. We have financial category labels attached to each operation. Categories include financial events, as well as everyday products and services such as ATM withdrawals, salary, or grocery shopping, among others (as we will explain hereafter). This problem can be expressed as a regression problem, where the input is a set of attributes of the customer history, while the output is the monetary amount to be anticipated. However, the problem presents several unique challenges. Firstly, as we work with monthly data, the time series are short, and limited to a couple of years; thus, we fail to capture long-term periodicities. Secondly, while personal expenses data can exhibit certain regular patterns, in most cases series can appear erratic or can manifest spurious spikes; indeed, it is natural that not all factors for predicting a future value are captured by past values.

Indeed, preliminary tests with classical time series methods such as the Holt-Winters procedure yielded poor results. One hypothesis is that classical time series method perform inference on a per-series bases and therefore require a long history as discussed in [17].

With the aim of evolving this solution, we asked ourselves whether Deep Learning methods can offer a competitive solution. Deep Learning algorithms have become the state-of-the-art in fields like computer vision or automatic translation due to their capacity to fit very complex functions

to large data sets of pairs .

Our proposed solution. This article recounts our experience in solving one of the main limitations of Deep Learning methods for regression: they do not take into account the variability in the prediction. In fact, one of the limitations is that the learned function

provides a point-wise estimate of the output target, while the model does not predict a probability distribution of the target or a range of possible values. In other words, these algorithms are typically incapable to assess how confident they are concerning their predictions.

This paper tackles a real-world problem of forecasting approaching customer financial expenses in certain categories based on the historical data available. We pose the problem as a regression problem, in which we build features for a user history and fit a function that estimates the most likely expense in the subsequent time period (dependent variable), given a data set of user expenses. Previous studies have shown that Deep Networks provide lower error rates than other models [8].

At any rate, Neural Networks are

forced to take a decision for all the cases. Rather than minimising a forecast error for all points, we seek mechanisms to detect a small fraction of predictions for each user where the forecast is confident. This is done for two reasons. Since notifying a user about an upcoming expense is a value-added feature, we ask ourselves whether it is viable to reject forecasts for which we are not certain. Furthermore, as a user may have dozens of expense categories, this is a way of selecting relevant impending expenses.

To tackle the issue of prediction with confidence, potential solutions in the literature combine the good properties of Deep Learning to generically estimate functions with the probabilistic treatments including Bayesian frameworks of Deep Learning ([18], [4], [3], [10]). Inspired by these solutions, we provide a formulation of Deep regression Networks which outputs the parameters of a certain distribution that corresponds to an estimated target and an input-dependent variance (i.e. Heteroscedastic) of the estimate; we fit such a network by maximum likelihood estimation. We evaluate the network by both its accuracy and by its capability to select “predictable” cases. For our purpose involving millions of noisy time series, we performed a comprehensive benchmark and observed that it outperforms non-trivial baselines and some of the approaches cited above.

2 Method

This section presents the details of the proposed method.

Figure 2:

Overview representation of the process to transform a typical Deep Learning model that has a single output given an input vector to a generic Aleatoric Deep Learning model with Gaussian distributions.

2.1 Deep Learning as a point estimation

A generic and practical solution to tackle the regression problem is Deep Learning as it is able to approximate any continuous function following the universal approximation theorem [16].

Straightforward implementations of Deep Learning models for regression do not typically consider a confidence interval (or score) of the estimated target. However, with our particular setting, it is critical that the model identifies the reliable forecasts, and can ignore forecasts where user-spending gets noisier. Furthermore, if we can find a correlation between the noisy time-series and the error of prediction, a strategy could be devised to improve the general performance of the system.

In this paper we denote as the function computed by the Neural Network, with a vector of inputs (the attributes) and the output, i.e. the forecasted amount. We will denote the set of all weights by .

In particular, we will use two standard types of layers. On the one hand, we consider non-linear stacking of Dense layers. A Dense layer is a linear combination of the weights with the input of the layer

passing through an activation function, act, i.e.

On the other hand, a LSTM layer ([15], [11]) which, like the other typical recurrent layers [6], maps a time-series with steps to an output that can have steps or fewer depending on our needs. In order to do this, for each of the steps, the input will be combined at step with the same weights and non-linearities for all the steps in the following way:

Where are the weights of the cell (shared for all the

steps) and sig are Sigmoid functions.

In the experimental section, we consider different Neural Network architectures by stacking Dense and/or LSTM layers. Fitting the network weights

involves minimizing a loss function, which is typically based on the standard

back-propagation

procedure. Nowadays, several optimizers are implemented in standard Deep Learning libraries which build on automatic differentiation after specifying the loss. All our Deep Learning models are implemented in Keras

[7]

with TensorFlow

[1] backend and specific loss functions are discussed below.

2.2 Types of uncertainty

With our system it will be critical to make highly accurate predictions, and, at the same time, assess the confidence of the predictions. We use the predictions to alert the user to potential impending expenses. Because of this, the cost of making a wrong prediction is much higher than not making any prediction at all. Considering the bias-variance trade-off between the predicted function and the real function to be predicted , the sources of the error could be thought as:

Where

is the expected value of a random variable

.

To achieve a mechanism to perform rejection of predictions within the framework of Deep Learning, we will introduce the notion of variance in the prediction. According to the Bayesian viewpoint proposed by [9], it is possible to characterise the concept of uncertainty into two categories depending on the origin of the noise. On the one hand, if the noise applies to the model parameters, we will refer to Epistemic uncertainty (e.g. Dropout, following [10], could be seen as a way to capture the Epistemic uncertainty). On the other hand, if the noise occurs directly in the output given the input, we will refer to it as Aleatoric uncertainty. Additionally, Aleatoric uncertainty can further be categorised into two more categories: Homoscedastic uncertainty, when the noise is constant for all the outputs (thus acting as a “measurement error”), or Heteroscedastic uncertainty when the noise of the output also depends explicitly on the specific input (this kind of uncertainty is useful to model effects such as occlusions / superpositions of factors or variance in the prediction for an input).

2.3 A generic Deep Learning regression Network with Aleatoric uncertainty management

This section describes the proposed framework to improve the accuracy of our forecasting problem by modelling Aleatoric uncertainty in Deep Networks.

The idea is to pose Neural Network learning as a probabilistic function learning. We follow a formulation of Mixture Density Networks models [3], where we do not minimise the typical loss function (e.g. mean squares error), but rather the likelihood of the forecasts.

Following [3] or [18]

, they define a likelihood function over the output of a Neural Network with a Normal distribution,

, where is the Neural Network function and is the variance of the Normal distribution. However, there is no restriction that does not allow us to use another distribution function if it is more convenient for our very noisy problem. So we decided to use a Laplace distribution defined as

(1)

which has similar properties and two (location and scale) parameters like the Normal one but this distribution avoids the square difference and square scale denominator of the Normal distribution with an empirically unstable behaviour in the initial points of the Neural Network weights optimisation or when the absolute error is sizeable. Furthermore, because of the monotonic behaviour of the logarithm, maximising the likelihood is equivalent to minimising the negative logarithm of the likelihood (Eq. 1), i.e. our loss function, , to minimise will be as follows

(2)

where are the weights of the Neural Network to be optimised.

In line with the above argument, note that captures the Aleatoric uncertainty. Therefore, this formulation applies to both the Homoscedastic and Heteroscedastic case. In the former case, is a single parameter to be optimised which we will denote .

In the latter, the is a function that depends on the input. Our assumption is that the features needed to detect the variance behaviours of the output are not directly related with the features to forecast the predicted value. Thus, can itself be the result of another input-dependent Neural Network , and all parameters of and are optimised jointly. It is important to highlight that is a strictly positive scale parameter, meaning that we must restrict the output of the or the values of the to positive values.

3 Experimental settings and results

3.1 Problem setting

Our problem is to forecast upcoming expenses in personal financial records with a data set constructed with historical account data from the previous 24 months. The data was anonymized, e.g. customer IDs are removed from the series, and we do not deal with individual amounts, but with monthly aggregated amounts. All the experiments were carried out within our servers. The data consists of series of monetary expenses for certain customers in selected expense categories.

Figure 3: Clustering of the normalised points time-series by using the transformation. The grey lines are samples and the blue line is the centroid for each cluster.

We cast this problem as a rolling-window regression problem. Denoting the observed values of expenses in an individual series over a window of the last months as , we extract attributes from . The problem is then to estimate the most likely value of the th month from the attributes, i.e fit a function for the problem , with , where we made explicit in the notation that the attributes depend implicitly on the raw inputs. To fit such a model we need a data set of instances of pairs , i.e. .

To illustrate the nature and the variability of the series, in Fig. 3 we visualise the values of the raw series grouped by the clusters resulting from a -means algorithm where

. Prior to the clustering, the series were centered by their mean and normalised by their standard deviation so that the emergent clusters indicate scale-invariant behaviours such as periodicity.

As seen in Fig. 3

, the series present clear behaviour patterns. This leads us to believe that casting the problem as a supervised learning one, using a large data set, is adequate and would capture general recurrent patterns.

Still, it is apparent that the nature of these data sets presents some challenges, as perceived from the variability of certain clusters in Fig 3, and in some of the broad distributions of the output variable. In many individual cases, an expense may result from erratic human actions, spurious events or depending on factors not captured by the past values of the series. One of our objectives is to have an uncertainty value of the prediction to detect those series - and only communicate forecasts for those cases for which we are confident.

3.1.1 Data pre-processing details

We generate monthly time series by aggregating the data over the contract id + the transaction category label given by an automatic internal categorizer which employs a set of strategies including text mining classifiers. While this process generates dozens of millions of time series in production, this study uses a random sample of 2 million time series (with

) for the training set, and 1 million time series for the test set. In order to ensure a head-to-head comparison with the existing system, the random sample is taken only from the subset of series which enter the forecasting module (many other series are discarded by different heuristics during the production process). It is worth noticing that most of these series have non-zero target values. From these series we construct the raw data

and the targets , from which we compute attributes.

As attributes, we use the values of the series, normalized as follows:

Where is a threshold value.

We also add the mean and standard deviation as attributes, which can be written as: .

The rationale for this choice is: (i) important financial behaviours such as periodicity tend to be scale-invariant, (ii) the mean recovers scale information (which is needed as the forecast is an real monetary value), (iii) the spread of the series could provide information for the uncertainty. We converged to these attributes after a few preliminary experiments.

3.2 Evaluation measures

We evaluate all the methods in terms of an Error vs. Reject characteristic in order to take into account both the accuracy and the ability to reject uncertain samples. Given the point estimates and uncertainty scores of the test set, we can discard all cases with an uncertainty score above a threshold , and compute an error metric only over the set of accepted ones. Sweeping the value of gives rise to an Error vs. Keep trade-off curve: at any operating point, one reads the fraction of points retained and the resulting error on those – thus measuring simultaneously the precision of a method and the quality of its uncertainty estimate.

To evaluate the error of a set of points, we use the well-known mean absolute error (MAE):

which is the mean of absolute deviations between a forecast and its actual observed value. To specify explicitly that this MAE is computed only on the set of samples where the uncertainty score is under , we use the following notation:

(3)

And the kept fraction is simply . Note that when is the maximum value.

Our goal in this curve is to select a threshold such that the maximum error is controlled; since large prediction errors would affect the user experience of the product. In other words, we prefer not to predict everything than to predict erroneously.

Hence, with the best model we will observe the behaviour of the MAE error and order the points to predict according to the uncertainty score for that given model.

In addition, we will add visualisations to reinforce the arguments of the advantages of using the models described in this paper that take into account Aleatoric uncertainty to tackle noisy problems.

3.3 Baselines and methods under evaluation

We provide details of the methods we evaluate as well as the baselines we compare are comparing against. Every method below outputs an estimate of the target along with an uncertainty score. We indicate in italics how each estimator and uncertainty score is indicated in Table 1. For methods which do not explicitly compute uncertainty scores, we will use the variance of the input series as a proxy.

Trivial baselines. In order to validate that the problem cannot be easily solved with simple forecasting heuristics such as a moving average, we first evaluate three simple predictors: (i) the mean of the series (mean), (ii) (zero), and (ii) the value of the previous month (last). In the three cases, we use the variance of the input as the uncertainty score (var).

Random Forest.

We also compare against the forecasting method which is currently implemented in the expense forecasting tool. The tool consists of a combination of modules. The forecasting module is based on random forest (

RF*). Attributes include the values of the 24 months () but also dozens of other carefully hand-designed attributes. The module outputs the forecasted amount and a discrete confidence label (low/medium/high), referred to as prec. For the sake of completeness, we also repeat the experiment using the variance of the input (var) as the confidence score.

General Additive Model. We also compare against a traditional regression method able to output estimates and their distribution, specifically we use the Generalized Additive Models (GAM) [2] with its uncertainty score denoted as SE. GAM is a generalized lineal model with a lineal predictor involving a sum of smooth functions of covariates [12] that follows the following form,

where ,

is the response variable that follows some exponential family distribution,

is a row of the model matrix for any strictly parametric model components,

is the parametric vector and are smooth functions of the covariates, , and is a known link function. We fit GAM in a simple way: . The parameters in the model are estimated by penalized iteratively re-weighted least squares (P-IRLS) using Generalized Cross Validation (GCV) to control over-fitting.

Deep Learning Models. Finally, we will compare a number of Deep Learning models with various strategies to obtain the forecasted amount and an uncertainty score.

Dense networks. We start with a plain Dense Network architecture (referred to as Dense

). In particular, our final selected Dense model is composed by 1 layer of 128 neurons followed by another of 64 neurons, each of the layers with a ReLu activation, and finally this goes to a single output. We use the Mean Absolute Error (MAE) as loss function. Since a Dense regression Network does not provide a principled uncertainty estimate, again we use the variance of the series,

, as the uncertainty score.

Epistemic Models. We also evaluate Bayesian Deep Learning alternatives, which offer a principled estimate of the uncertainty score. These carry out an Epistemic treatment of the uncertainty (see Section 2.2), rather than our Aleatoric treatment. We consider two methods: (i) adding Dropout [10] (with a 0.5 probability parameter of Dropout) (referred to as DenseDrop), and (ii) Bayes-By-Backprop variation [4], denoted as DenseBBB. Both techniques model the distribution of the network weights, rather than weight values. This allows us to take samples from the network weights. Therefore, given a new input, we obtain a set of output samples rather than a single output; we can take the mean of those samples as the final target estimate and its standard deviation as our uncertainty score, i.e. . In Table 1, we denote the uncertainty score of Dropout and BBB as drop and BBB, respectively. For the sake of completeness, we also consider the simple case of the variance (var).

Homoscedastic and Heteroscedastic Models. Here we consider fitting a network under the Aleatoric model (see 2), both for the Homoscedastic and Heteroscedastic cases. It is important to highlight that, in both cases, the way used to restrict the values of to positive values was by applying a function to the values of the or output of defined as the translation of the ELU function plus 1,

In addition, in the Heteroscedastic case, we used the same Neural Network architecture for the “variance” approximation than for the “mean” approximation .

In the Homoscedastic case, since the variance is constant for all the samples, we have to resort to the variance of the series as the uncertainty scores. In contrast, the Heteroscedastic variance estimate is input-dependent and can thus be used as the uncertainty score (denoted as ) in Table 1.

All the Deep Learning models were trained during epochs with Early Stopping by using the of the training set as a validation set.

LSTM networks.

To conclude, we will repeat the experiment by replacing the Dense Networks with long short-term memory (

LSTM) networks, which are more suited to sequential tasks. Here we consider both types of aletoric uncertainty (i.e. LSTMHom and LSTMHet). In this particular case, the architecture uses two LSTM layers of 128 neurons followed by a 128 neurons Dense layer, and finally a single output.

3.4 Results

Figure 4: Error-keep curve: MAE versus fraction of the samples kept obtained by selected forecasting methods, when cutting at different thresholds of their uncertainty scores.

Models comparison. In Figure 4 we present a visual comparison between the Error-Keep curves (Eq. 3) of RF* (indicating the points of low/medium/high confidence), RF* using var as confidence score, the GAM (which uses its own uncertainty estimate), and the best Dense and LSTM models (which, as will be detailed later, use the Heteroscedastic uncertainty estimates).

First we notice that the Heteroscedastic solutions are better than all the other previous solutions. This confirms our assumption that, for our given noisy problem, taking into account the variability of the output provides better accuracy at given retention rates.

Predictor + uncertainty K=25% K=41% K=50% K=75% K=99.5% K=100%
+
Zeros +
Last +
GAM +
GAM +
RF* + prec N/A N/A N/A
RF* +
Dense +
DenseDrop +
DenseDrop + Drop
DenseBBB +
DenseBBB + BBB
DenseHom +
DenseHet +
DenseHet +
LSTM +
LSTMHom +
LSTMHet +
LSTMHet +
Table 1: Errors (MAE) of each method at points of the Error-Keep curve corresponding to Keep=K. Each row corresponds to the combination of a predictor and an uncertainty score (described in the main text).

For a more detailed analysis, Table 1 shows the MAE values of all the compared methods for several cut-off points of the Error-Keep curve, namely for keep=25%,50%,75%,100%, and additionally for 41% and 99.5% (for direct comparison to the currently implemented RF* method, because these are the cut-off values obtained by the RF* method with confidence=high and confidence=medium). Note that we are more interested in low values of keep (typically below 50%), as they correspond to “selecting” the most confident samples; but we show all the percentages for completeness.

For Deep Learning models, to avoid a known issue of sensitivity with respect to the random seeds, we repeated all experiments times with different random initialisations and report the mean and standard deviation over the 6 runs.

Table 1 shows different interesting aspects. First, we confirm that the best performing model is the LSTM network with the proposed Heteroscedastic uncertainty treatment (except in the points where keep 99.5%. However, these are not of practical importance because there is virtually no rejection, and the total number of errors are high for all the methods). Had we not considered LSTM networks as an option, the best performing model would still be a Dense Network with Heteroscedastic uncertainty. We also observed that the performance ranking of the incremental experiments between Dense and LSTM is consistent.

We also note that, for this particular task, the Aleatoric methods (both Homoscedastic and Heteroscedastic) perform better than Epistemic models such as Dropout or BBB. We believe this to be specific to the task at hand where we have millions of short and noisy time series. We conclude that the variability of human spending, with complex patterns, erratic behaviors and intermittent spendings, is captured more accurately by directly modelling an input-dependent variance with a complex function, than by considering model invariance (which may be better suited for cases where the data is more scarce or the noise smoother).

Another interesting observation is that the Random Forest baseline used more attributes than the proposed solutions based on Deep Learning, which just used values of the historical time series. This confirms that fitting a Deep Network on a large data set is a preferred solution; even more when being able to model the uncertainty of the data, as is the case here.

Last but not least, comparing the same uncertainty score with a certain model and its Heteroscedastic version we also observe that there is an accuracy improvement. This means that taking into account uncertainty in the training process, in our noisy problem, not only gives to us an uncertainty score to reject uncertain samples but also it helps in order to predict better.

Figure 5: Correlation between MAE of prediction and the real value and its uncertainty score of random selection of time series of the test set. The colours represent the density of the zones by using Gaussian kernels.

Error-Uncertainty score correlation. An interesting question is whether or not there exists any relationship between the errors of a model and the uncertainty scores it provides. In Figure 5 we show the correlation between the MAE of

randomly selected points of the test set and the uncertainty score of the best Heteroscedastic model expressed in the logarithm scale. To reduce the effect of clutter, we used a colour-map for each point that represents the density of the different parts of the Figure as resulting from a Gaussian Kernel Density Estimate.

This figure exhibits two main regimes. On the one hand, for high uncertainty scores (), we observe scattered samples (considering their purple colour) where the errors range from very low to very high values and do not seem to align with the uncertainly scores. Upon inspection, many of these series correspond to what humans would consider ‘unpredictable”: e.g. an expense in month which was much larger than any of the observed expenses, or series without a clear pattern. On the other hand, we observe a prominent high-concentration area (the yellow-ish one) of low-error and low-uncertainty values; indicating that, by setting the threshold under a specific value of the uncertainty score, the system mostly selects low-error samples.

4 Related Work

In the case of using Deep Learning for classification, it is common to use a Softmax activation function in the last layer [5]. This yields a non-calibrated probability score which can be used heuristically as the confidence score or calibrated to a true probability [13]. However, in the case of regression, the output variables are not class labels and it is not possible to obtain such scores. On the other hand, [18] introduces the idea of combining different kinds of Aleatoric and Epistemic uncertainties. Nevertheless, as we saw in Table 1, the use of Epistemic uncertainty for our problem worsens the accuracy and the general performance of the model.

While there have been several proposals to deal with uncertainties in Deep Learning, they boil down to two main families. On the one hand, some approaches consider the uncertainty of the output. Typically, they construct a network architecture so that its output is not a point estimate, but a distribution [3]. On the other hand, some approaches consider the uncertainty of the model. These apply a Bayesian treatment to the optimisation of the parameters of a network ([10], [4], [14], [21]). In the present work we observe that if a highly noisy problem can change the loss function in order to minimise a likelihood function as in [3], and by introducing some changes explained above, we are provided with a significant improvement which is crucial to identifying forecasts with a high degree of confidence and even improve the accuracy.

We are dealing with a problem that contains high levels of noise. To make predictions is therefore risky. These are the two reasons why it was our goal to find a solution that improves the mean-variance solution that can be a “challenging” solution. In order to do that, we grouped several theories proposed in [18] and [3] to create a Deep Learning model that takes into account the uncertainty and provides a better performance.

5 Conclusion

We explore a new solution for an industrial problem of forecasting real expenses of customers. Our solution is based on Deep Learning models for effectiveness and solve the challenge of uncertainty estimation by learning both a target output and its variance, and performing maximum likelihood estimation of the resulting model that contains one network for the target output and another for its variance. We show that this solution obtains better error-reject characteristics than other (traditional and Deep) principled models for regression uncertainty estimation, and outperforms the characteristic that would be obtained by the current industrial system in place. While Epistemic models such as Dropout or BBB did not improve the performance in this specific task, we are already working in combining them with our Aleatoric treatment to consider both types of uncertainty in the same model. This is considered future work. We also highlight that, while the present model seems to be able to detect confident predictions, it still lacks mechanisms to deal with the “unknown unknowns” problem; and believe that incorporating ideas such as those in

[13] may help in future work.

Acknowledgements

We gratefully acknowledge the Industrial Doctorates Plan of Generalitat de Catalunya for funding part of this research. The UB acknowledges the support of NVIDIA Corporation with the donation of a Titan X Pascal GPU and recognizes that part of the research described in this chapter was partially funded by TIN2015-66951-C2, SGR 1219. We also thank Alberto Rúbio and César de Pablo for insightful comments as well as BBVA Data and Analytics for sponsoring the industrial PhD.

References

  • [1] Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng., X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015), https://www.tensorflow.org/, software available from tensorflow.org
  • [2] Anderson-Cook, C.M.: Generalized additive models: An introduction with r (2007)
  • [3] Bishop, C.M.: Mixture density networks (1994)
  • [4] Blundell, C., Cornebise, J., Kavukcuoglu, K., Wierstra, D.: Weight uncertainty in neural networks. ICML (2015)
  • [5]

    Bridle, J.S.: Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In: Neurocomputing, pp. 227–236. Springer (1990)

  • [6] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  • [7] Chollet, F., et al.: Keras. https://keras.io (2015)
  • [8] Ciprian, M., Baldassini, L., Peinado, L., Correas, T., Maestre, R., J.A. Rodriguez-Serrano, Pujol, O., Vitrià, J.: Evaluating uncertainty scores for deep regression networks in financial short time series forecasting. Workshop on Machine Learning for Spatiotemporal Forecasting, in NIPS (2016)
  • [9] Der Kiureghian, A., Ditlevsen, O.: Aleatory or epistemic? does it matter? Structural Safety 31(2), 105–112 (2009)
  • [10] Gal, Y., Ghahramani, Z.: Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: ICML. pp. 1050–1059 (2016)
  • [11] Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: Continual prediction with lstm (1999)
  • [12] Hastie, T.J., Tibshirani, R.J.: Generalized additive models. London: Chapman & Hall (1990)
  • [13] Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. ICLR (2017)
  • [14]

    Hernández-Lobato, J.M., Adams, R.: Probabilistic backpropagation for scalable learning of bayesian neural networks. In: ICML. pp. 1861–1869 (2015)

  • [15] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
  • [16] Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural networks 2(5), 359–366 (1989)
  • [17] Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. International journal of forecasting 22(4), 679–688 (2006)
  • [18] Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: NIPS. pp. 5580–5590 (2017)
  • [19] Mitrović, S., Singh, G.: Predicting branch visits and credit card up-selling using temporal banking data. ECML/PKDD (2016)
  • [20] Mutanen, T., Ahola, J., Nousiainen, S.: Customer churn prediction-a case study in retail banking. In: Proc. of ECML/PKDD Workshop on Practical Data Mining. pp. 13–19 (2006)
  • [21] Rasmussen, C.E.: A practical monte carlo implementation of bayesian learning. In: Advances in Neural Information Processing Systems. pp. 598–604 (1996)
  • [22] Wistuba, M., Duong-Trung, N., Schilling, N., Schmidt-Thieme, L.: Bank card usage prediction exploiting geolocation information. ECML-PKDD 2016 Discovery Challenge on Bank Card Usage Analysis (2016)