1 Introduction
The ability to quantify the uncertainty of a model is one of the fundamental requirements in trusted, safe, and actionable AI (Arnold et al., 2019; Jiang et al., 2018; Begoli et al., 2019).
This paper focuses on uncertainty quantification in regression tasks, particularly in the context of deep neural networks (DNN). We define a sequential task as one involving an ordered series of input elements, represented by features, and an ordered series of outputs. In sequential
regression tasks (SRT), the output elements are (possibly multivariate) realvalued variables. SRT occur in numerous applications, among others, in weather modeling, environmental modeling, energy optimization, and medical and biotechnology applications. When the cost of making an incorrect prediction is particularly high, such as in human safety, medical decisions or highinvestment resource exploration, models without a reliable uncertainty estimation are perceived high risk and may not be adopted.Uncertainty prediction in DNNs has been subject to active research, in particular, spurred by what has become known as the “Overconfidence Problem” of DNNs (Guo et al., 2017), and by their susceptibility to adversarial attacks (Madry et al., 2017). However, the bulk of work is concerned with nonsequential, classification tasks (see Section 2) leaving a noticeable gap for SRT. Furthermore, unlike with classification, the uncertainty evaluation methodology for SRT remains rather undeveloped.
In this paper we introduce a metamodeling concept as an approach to achieving highquality uncertainty quantification in DNNs for SRT. We demonstrate that it not only outperforms competitive baselines but also provides consistent results across a variety of drift scenarios. We believe the approach represents a new powerful addition to the modeling toolbox in general.
The novel contributions of this paper are summarized as follows:

Application of the metamodeling concept to SRT

Developing a joint basemeta model and its comparison to white and blackbox alternatives

Generating asymmetric uncertainty bounds in DNNs

Proposing a new evaluation methodology for SRT
2 Related Work
Classical statistics on time series offers an abundance of work dealing with uncertainty quantification (Papoulis and Saunders, 1989)
. Most notably in econometrics, a variety of heteroskedastic variance models lead to highly successful application in financial market volatility analyses
(Engle, 1982; Bollerslev, 1986; C., 1991). An Autoregressive Conditional Heteroskedastic, or ARCH, model (Engle, 1982), and its generalized version, GARCH, (Bollerslev, 1986) are two such methods. They assume an autoregressive movingaverage, or ARMA, model to underlie the variance series. Besides its stationarity assumptions, long horizon forecasting using (G)ARCH is difficult (Starica et al., 2005; Chen and Politis, 2019) and is typically used in a onestepahead setting. We use the GARCH as one of our baselines.Bayesian modeling is a probabilistic approach that naturally integrates the notion of uncertainty. An illuminating study in image recognition (Kendall and Gal, 2017) describes an integration of two sources of uncertainty, namely the epistemic (due to model) and the aleatoric (due to data). The authors propose an approach utilizing a variational approximation of Bayesian Neural Networks and an implicit Gaussian model to quantify both types of variability in a nonsequential classification and regression task. The study, however, does not offer a comparison of different methods generating uncertainty. Based on (Nix and Weigend, 1994), (Lakshminarayanan et al., 2017) also uses a Gaussian variance integrated in a negative loglikelihood training objective with the aim at improving the predictive performance of the base model, again in a nonsequential setting. Similar to (Kendall and Gal, 2017), the study does not focus on comparing the quality of the uncertainty to other methods. We use the implicit variance model of (Kendall and Gal, 2017; Oh et al., 2020; Lakshminarayanan et al., 2017), as well as the method of variational dropout of (Gal and Ghahramani, 2016; Kendall and Gal, 2017) as baselines in our work. A metamodeling approach was taken in (Chen et al., 2019) aiming at the task of instance filtering using whitebox models. The work relates to ours through the metamodeling concept but concentrates on classification in a nonsequential setting. Besides its application in filtering, metamodeling has been widely applied in the task of learning to learn and lifelong learning (Schmidhuber, 1987; Finn et al., 2019). Uncertainty in data drift conditions was assessed in a recent study (Snoek et al., 2019). The authors employ calibrationbased metrics to examine various methods for uncertainty in classification tasks (image and text data), and conclude, among others that most methods’ quality degrades with drift. Acknowledging drift as an important experimental aspect, our study takes it into account by defining matched and drifted scenarios. (Shen et al., 2018) described a multiobjective training of a DNN in wind power prediction, minimizing two types of cost related to coverage and bandwidth. We expand on these metrics in Section 3.3.1.
3 Method
3.1 Meta Modeling Approach
The basic concept of Meta Modeling (MM), depicted in Figure 1, involves a combination of two models comprising a base model, performing the main task (e.g., regression), and a meta model, learning to predict the base model’s error behavior. Depending on the amount of information shared between these two, we distinguish several settings, namely (1) base model is a blackbox (BB), (2) base is a whitebox (WB, base parameters are accessible), and (3) base and meta components are merged and trained jointly (JM). The advantages of WB and JM are obvious: rich information is available for the meta model to capture salient patterns for it to generate accurate predictions. On the other hand, the BB setting may offer utility in cases of the base model internals not being accessible.
While the MM concept applies to all tasks including classification, detection, ranking, and regression, this work focuses on the latter exclusively.
We now formalize the MM concept for sequential regression. Let be the base model function parametrized by , where and represent sequences of
input feature vectors and
dimensional output vectors, with . Let denote the meta model, parameterized by , taking as input the predictions, the original features, and the parameters of the base to produce a sequence of error predictions, . The parameters are obtained by solving an optimization problem,, using a smooth loss function
, e.g., the Frobenius norm . Similarly, the parameters are determined via involving a loss quantifying the target error (residual) from the base model, and quantifying the prediction error of the meta model. In general, , , and , may differ. The expectations are estimated using an available dataset. We used the norm for and , and for , as described in Section 4.2. Given the loss functions above and the DNN setting, the base and the meta model can be integrated in a single network (JM). In this case the parameters are estimated jointly via(1) 
whereby dedicated output nodes of the network generate and , and is a hyperparameter trading off the base with the meta loss. Thus, one part of the network tackles the base task, minimizing the base residual, while another models the residual as the eventual measure of uncertainty. As done in (Kendall and Gal, 2017), one can argue that the base objective minimizes the epistemic (parametric) uncertainty, while the meta objective captures the aleatoric uncertainty present in the data. Due to their interaction, the base loss is influenced by the estimated uncertainty encouraging it to focus on featurespace regions with lower aleatoric uncertainty. Moreover, the base model is encouraged to encode the input in ways suitable for uncertainty quantification.
Figure 2 shows an overview of a sequential DNN architecture applied throughout our study. It includes a base encoderdecoder pair and a meta decoder connected to them. Each of these contains a recurrent memory cell  the LSTM (Hochreiter and Schmidhuber, 1997). The role of the encoder is to process the sequential input, , to compress its information in a context vector and pass it to the base decoder. The recurrent decoder produces the regression output in time steps. Evolving in time, both base LSTMs update their internal states and (see Fig. 2), with the last state, , serving as the starting context in the decoder. The decoder consumes its own prediction from the previous time step as input to generate the next output. This architecture has gained wide popularity in applications such as speechtotext (Chiu et al., 2018; Tüske et al., 2019), texttospeech (Sotelo et al., 2017), machine translation (Sutskever et al., 2014)
, and image captioning
(Rennie et al., 2016). Following the MM concept, we attach an additional decoder (the meta decoder) via connections to the encoder and decoder outputs. The context vector, , is transformed by a fully connected layer (FCN in Figure 2), and both the output as well as the internal state, , are fed into the meta component. As mentioned above, the meta decoder generates uncertainty estimates, .Given the architecture depicted in Figure 2, we summarize the three settings as follows:

Joint Model (JM): parameters are trained according to Eq. (1) with certain values of

WhiteBox model (WB): base parameters are trained first, with . Then, with fixed (nontrainable), parameters are estimated.

BlackBox model (BB): the meta decoder part of the model is removed, training only. Then, a separate encoderdecoder model is trained based solely on the input to predict the residual .
3.1.1 Generating Symmetric and Asymmetric Bounds
The choice of loss function, , gives rise to two scenarios. If is an even function, e.g., , the predictions reflect symmetric uncertainty. If, on the other hand, takes the sign in into account, it is possible to dedicate separate network nodes to produce lower and upper band^{1}^{1}1The term “band” refers to the bound’s positive deviate from estimates, and , respectively, thus accomplishing asymmetric prediction. Let . For the asymmetric scenario the meta objective is modified as follows:
(2) 
Overall, for a dimensional regression output at time , , the meta decoder will have uncertainty output nodes in the symmetric and outputs in the asymmetric case.
3.2 Baselines
3.2.1 Implicit Heteroskedastic Variance
(Lakshminarayanan et al., 2017; Kendall and Gal, 2017; Oh et al., 2020) applied a Gaussian model to the output of a neural network predictor, where represents the prediction and its uncertainty due to observational (aleatoric) noise. The model is trained to minimize the negative loglikelihood (NLL), with the variance being an implicit uncertainty parameter (in that it is trained indirectly) which is allowed to vary across the feature space (heteroskedasticity). We apply the Gaussian in the sequential setting by planting it onto the base decoder’s output (replacing the meta decoder) and train the network using the NLL objective:
(3) 
with output nodes modeling the regression variable, , and separate D output nodes modeling the , at time .
This approach resembles the metamodeling concept somewhat but deviates in that the variance parameter is not supervised using the residual target but emerges as a byproduct of the Gaussian model.
3.2.2 Variational Dropout
(Gal and Ghahramani, 2016) established a connection between dropout (Srivastava et al., 2014)
, i.e., the process of randomly omitting network connections, and an approximate Bayesian inference. Their work suggests applying the dropout principle to the inputs, internal states, as well as outputs of the recurrent cells. However, unlike with traditional dropout, the method is applied both in training and test, and for each RNN sequence the dropout pattern is kept fixed.
We apply the variational dropout method to the base encoder and decoder. By performing multiple runs per test sequence, each with a different random dropout pattern, the base predictions are calculated as the mean and the base uncertainty as the variance over such runs.
The Bayesian method, along with the variational approximation, captures the parametric (epistemic) uncertainty of the model, hence it fundamentally differs from the Gaussian model as well as our proposed approach.
3.2.3 GARCH Variance
Introduced in (Bollerslev, 1986; Engle, 1982), the Generalized Autoregressive Conditional Heteroskedastic (GARCH) variance model belongs among the most popular statistical methods. A GARCH(p,q) assumes the series to follow an autoregressive moving average model and estimates the variance at time as a linear combination of past residual terms, , and previous variances, :
(4) 
The term represents a constant component of the variance. The parameters, are estimated via maximumlikelihood on a training set. The GARCH process relates to the concept in Figure 1 in that it acts as the metamodel predicting the squared residual. At training time, we take the residual using predictions from the base model, while at test time, the residual is approximated using an autoregressive component. We use the GARCH as a baseline only on one of the datasets for reasons discussed in Section 4.2.2.
3.2.4 Naive Baselines
A consistent comparison of uncertainty methods is difficult due to the fact that each generates an uncertainty around different base predictions. Therefore, as a reference we also generate a constant symmetric band around each base predictor. A constant bound represents a homoskedastic process – a sensible choice in many wellbehaved sequential regression problems, corresponding to a GARCH(0,0) model. We will use this reference point to compute a relative gain of each method as explained in Section 3.4.
3.3 Evaluation Methodology
3.3.1 Core Metrics
We propose several metrics to assess the quality of the uncertainty predictions. Unlike with classification tasks, where standard calibrationbased metrics apply (Snoek et al., 2019), we need to consider two aspects arising in regression, roughly speaking: (1) what is the extent of observations falling outside the uncertainty bounds (Type 1 cost), and (2) how excessive in width are the bounds (Type 2 cost). An optimal bound captures all of the observation while being least excessive in terms of its bandwidth. (Shen et al., 2018) defined two measures reflecting these aspects (miss rate and bandwidth) and we expand on these. Let and denote the predicted lower and upper band, respectively. Recall that . We define the following metrics:
(5) 
(6) 
(7) 
(8) 
Figure 3 illustrates these metrics. The relative proportion of observations lying outside the bounds (miss rate) ignores the extent of the bound’s short fall. The Deficit, Eq. (8), captures this. The type 2 cost is captured by the Bandwidth, Eq. (6). However, its range is indirectly compounded by the underlying variation in and . Therefore we propose the Excess measure, Eq. (7), which also reflects the Type 2 cost, but just the portion above the minimum bandwidth necessary to include the observation.
Note that the above definitions involve uniform averaging over the dimensions. In some cases it may be desirable to use weighted averaging in order to equalize the impact of individual dimensions.
3.3.2 Calibration
In general, DNNs offer few guarantees about the behavior of their output. It is known that DNNs tend to produce miscalibrated classification probabilities
(Guo et al., 2017). In order to evaluate the regression uncertainty across models, it is necessary to establish a common operating point (OP). We achieve this via a scaling calibration. For symmetric bounds, we assume that where is a random i.i.d. matrix and is the nonnegative uncertainty prediction. Let . Using a heldout dataset, we can obtain an empirical distribution in each output dimension: . It is then possible to find the valuefor a desired quantile
, e.g., , and construct the prediction bound at test time: . Assuming is stationary, in expectation, this bound will contain the desired proportion of the observations, i.e., .This scaling is applied in our evaluation to compare the excess, bandwidth and deficit at fixed miss rates as well as setting a minimum cost OP (see Section 3.4). An Algorithm to find a scale factor for a desired value of any of the four metrics in operations is given in the Appendix.
3.4 Metrics Used in Reporting
Recall that an OP refers to a value of a desired metric measured at a fixed value of a second metric. The following OPbased measures are used in reporting: (1) Excess, Deficit, Bandwidth at a fixed Missrate, averaged over , and (2) Minimum ExcessDeficit cost, where with the minimum over all scales.
As alluded to in Section 3.2.4, for each system and measure, , the symmetric constantband baseline, , is also created and the relative percent gain with respect to this reference calculated: . As the main measure for symmetric results we choose the average of Excess and Deficit over the abovelisted Missrate range, and the minimum cost. When reporting asymmetric results we also give the Bandwidth average over same Missrate range.
Finally, the error rate of the base predictor is calculated as , where are th row vectors of .
4 Experiments
4.1 Datasets
Two sequential regression datasets, namely the Metro Interstate Traffic Volume (MITV) dataset^{2}^{2}2https://archive.ics.uci.edu/ml/machinelearningdatabases/00492/ and the SPE9 Reservoir Production Rates (SPE9PR) dataset^{3}^{3}3https://developer.ibm.com/exchanges/data/all/oilreservoirsimulations/, were experimented with. Both originate from realworld applications, involve sequential input/output variables, and provide for scenarios with varying degrees of difficulty.
Dataset  Total  Input  Output  Time  Partition Size  

Samples  Features  Dim  Resol.  TRAIN  DEV  DEV2  TEST  TEST (drift)  
MITV  48204  8  1  1 hr  33744  4820  4820  4820  n/a 
SPE9PR  28000  248  4  90 days  24000  1000  1000  1000  1000 
4.1.1 Metro Interstate Traffic Volume (MITV)
The dataset is a collection of hourly westboundtraffic volume measurements on Interstate 94 reported by the Minnesota DoT ATR station 301 between the years 2012 and 2018. These measurements are aligned with hourly weather features^{4}^{4}4provided by OpenWeatherMap as well as holiday information, also part of the dataset. The target of regression is the hourly traffic volume. This dataset was released in May, 2019.
The MITV input features were preprocessed to convert all categorical features to trainable vector embeddings, as outlined in Figure 2. All realvalued features as well as the regression output were standardized before modeling (with the test predictions restored to their original range before calculating final metrics). Overall dataset statistics are listed in Table 1, further processing steps are given in Section 4.2, and complete preprocessing details listed in the Appendix.
4.1.2 SPE9 Reservoir Production Rates (SPE9PR)
This dataset originates from an application of oil reservoir modeling. A reservoir model (RM) is a spacediscretized approximation of a geological region subject to modeling. Given a sequence of drilling actions (input), a physicsbased PDEsolver (simulator) is applied to the RM to generate sequences of future production rates (oil, gas, water production), typically over long horizons (Killough, 1995). The objective is to train a DNN and accurately predict outputs on unseen input sequences. We used the publicly available SPE9^{5}^{5}5https://github.com/OPM/opmdata/blob/master/spe9/SPE9.DATA
RM, considered a reference for benchmarking reservoir simulation in the industry, and an opensource simulator
^{6}^{6}6https://opmproject.org/ to produce 28,000 simulations, each with 100 randomized actions (varying type, location, and control parameters of a well) inducing production rate sequences over a span of 25 years, in 90day increments, i.e., 100 time steps. Furthermore, the RM was partitioned into two regions, A and B. While most of the actions are located in the region A, we also generated 1000 sequences with actions located in the region B thus creating a large degree of mismatch between training and test. The test condition in region B will be referred to as “drift” scenario.Basic preprocessing, including standardization and vector embedding of categorical variables was performed, resulting in 268dimensional input feature vectors. The output variable is a 4dimensional vector capturing 3 production rates (oil, gas, water) and 1 injection rate (water). Additional detail is given in the Appendix. We are releasing this dataset to the ML community.
4.2 Training Procedure
Each dataset was partitioned into TRAIN, DEV, DEV2, and TEST sets (with SPE9PR also providing a TESTdrift set), as listed Table 1
. While the TRAIN/DEV partitions served basic training, the DEV2 was used in determining certain hyperparameters and metric operating points (calibration). The TEST sets were used to produce the reported metrics.
While a single sample in the SPE9PR represents a sequence of 100 steps, the MITV data come as a single contiguous sequence. After partitioning, each MITV sequence was processed by a sliding window of length 36 hours (in 1hour steps). This resulted in a series of subsequences ( denotes the partition size) to feed the encoderdecoder model. When testing on the MITV, DNN predictions from such sliding windows were recombined into a contiguous prediction sequence by skipping blocks of windows.
4.2.1 Common Steps
The base encoderdecoder network (see Figure 2) is trained using the Adam optimizer (Kingma and Ba, 2014) with a varying initial learning rate, lr, in two stages:

Train all parameters using TRAIN set with lr=0.001. Early stopping determined on the DEV set. The ground truth is used for all decoder inputs at (see Figure 2).

Build on model from previous step, using TRAIN, lr=0.0002, but switch to using previous decoder output, as input in the current step, referred to as emulation (Bengio et al., 2015). Only for the MITV setup: Continue using the ground truth for the first 12 time steps before switching to emulation mode for the rest 24 steps, thus considering the 12 hours as an observed period for every 24hour prediction period.
All other hyperparameter values are listed in detail in the Appendix. Their values originate from a separate study related to the SPE9 reservoir modeling (Navrátil et al., 2019).
4.2.2 Individual Systems
Joint Model, Symmetric (JMS), and Asymmetric (JMA):
The common training steps are performed using the objective in Eq. (1), with , first. Then, the joint training continues with as long as the objective improves on DEV. In a final step, the model switches to using DEV as training with until no improvement on TRAIN is seen. This step helps the meta model see patterns from the base outside its training set. A similar procedure is followed for the JMA, except using Eq. (2).
WhiteBox Model, Symmetric (WBMS):
The basic training is performed with . Next, only meta model parameters are estimated using the DEV/TRAIN sets with .
BlackBox Model, Symmetric (BBMS):
Base training is performed. The base model processes the DEV set to generate residual . A separate encoderdecoder model is then trained using ().
Joint Model with Variance (JMV):
Dropout Model Symmetric (DOMS):
Dropout with a rate of 0.25 (inputs, outputs) and 0.1 (internal states), determined as best performing on the DEV2 set, were applied in both the base encoder and the decoder. The two common steps were performed. At test time, the model was run 10 times per each test sequence to obtain the mean prediction and standard deviation.
Garch:
The GARCH model from Section 3.2.3 is used with . The lag value was determined using an autocorrelation chart showing attenuation at lags . We only apply this baseline to the MITV dataset as it provides a contiguous time series. The model parameters were trained using the DEV2 partition.
Throughout the experiments, the size of each LSTM cell was kept fixed at 32 for the base encoder/decoder, and at 16 for the meta decoder. The base sizing has been largely driven by our preliminary study showing it suffices in providing accurate base predictions.
4.3 Testing Procedure
With the MITV dataset, we want to simulate a scenario in which given the past 12 hours worth of traffic observations the model is to make a forecast for the next 24 hours. This is achieved by spanning a 36hour window and giving the decoder access to the first 12 hours of ground truth (see Step 2 in Section 4.2.1). Note that when recombining the windowed predictions, only the netforecast 24hour segments are taken. Furthermore, in order to gauge the model behavior in a mismatched scenario, we add testing the MITV models without providing those first 12 hours of observation. Loosely speaking, this emulates a “drift” condition as the model was trained to rely on the ground truth information but now receives estimates of the same only.
No special postprocessing is necessary on the SPE9PR predictions, the metrics are averaged over all simulations, timesteps, and output dimensions.
4.4 Results with Symmetric Bounds
Table 2 compares the proposed symmetricbounds systems (JMS, WBMS, BBMS) with the baselines (JMV, DOMS, GARCH). The relative error of the base predictor is given in the column. The uncertainty quality is reported in Table 2 is the average gain in excessdeficit metrics, as defined in Section 3.4. Columns labeled as contain measurements made at an operating point (OP) determined on the test set itself, while those labeled as use an OP from a heldout (DEV2) set. While reflects generalization of the calibration, values are interesting as they reveal the potential of each method. Based on a paired permutation test (Dwass, 1957), all but entries marked with are mutually significant at .
From Table 2 we make the following observations: (1) the JMS model dominates all other models across all conditions. The fact that it outperforms the WBMS indicates there is a benefit to the joint training setup, as conjectured earlier. (2) The WBMS dramatically outperforms the BBMS model, which remains only as good as a constant band for MITV data. It appears it is hard to reliably predict residuals from only the input features. (3) The most competitive baseline is the JMV model. As discussed in Section 3.2.1, the JMV shares some similarity with the metamodeling approach. (4) The JMS and WBMS models perform particularly well in the strong drift scenario (SPE9PR), suggesting that whitebox features play an essential role in achieving generalization. Finally, (5) the DOMS model works moderately well on MITV data but provides no benefit in the SPE9PR, which could be due the aleatoric uncertainty playing a dominant role in this dataset. In almost all cases, however, the averaging of base predictions in DOMS results in lowest error rates of the base predictor.
Representative samples of JMS and JMV uncertainty bounds are shown in Figure 4 (MITV) and Figure 5 (SPE9PR). They illustrate a clear trend in the results, namely that the JMS (also seen with WBMS) model are better able to cover the actual observation, particularly when the base prediction tends to make large errors.
MITV  SPE9PR  

System  match  drift  match  drift  
JMS  .155  57.8  59.3  .187  52.5  55.6  .165  46.5  46.6  .291  56.6  50.5 
WBMS  .159  53.8  55.3  .190  39.4  42.3  .162  44.8  44.9  .313  54.5  43.8 
JMV  .167  20.4  25.5  .179  20.0  17.5  .159  45.3  45.5  .334  6.5  4.4 
DOMS  .144  13.2  12.8  .155  16.6  17.3  .177  1.3  1.4  .279  0.2  5.2 
BBMS  .153  0.4  1.2  .188  8.9  3.0  .170  31.9  30.3  .326  11.6  12.8 
GARCH  .155  1.8  3.1  .187  14.7  4.0  n/a  n/a  n/a  n/a  n/a  n/a 
MITV  SPE9PR  

Evaluation  match  drift  match  drift  
JMA Base Error  .158  .200  .168  .320  
Bandwidth 
JMA, opt. % gain  44.1  37.0  35.5  54.3 
JMA, xval % gain  33.1  26.4  35.5  8.5  
JMS, opt. % gain  37.5  31.6  29.1  34.3  
JMS, xval % gain  35.2  27.2  28.7  4.4  
ExDeficit 
JMA, opt. % gain  45.3  29.0  30.9  50.8 
JMA, xval % gain  40.4  28.6  30.8  24.2  
JMS, opt. % gain  57.8  52.5  46.5  56.6  
JMS, xval % gain  59.3  55.6  46.6  50.5 
4.5 Results with Asymmetric Bounds
Generating asymmetric bounds is a new intriguing aspect of DNNbased metamodels. Using the JMA model, we first recorded the accuracy with which the asymmetric output agrees in sign (orientation) with the observed base discrepancy. Averaged over the two datasets, this accuracy is at 83.3%, and 91.1% when filtering out samples with small deviations from the ground truth (noise). The promise of asymmetric bounds lies in its potential to reduce the bandwidth cost. Since the Excess and Deficit metrics ignore the absolute bandwidth, we also evaluate the JMA model using the Bandwidth metric (Eq. (6)), averaged over the same operating points, as shown in Table 3. The table compares the JMA model to the best symmetric model, JMS. The JMS model outperforms JMA in all scenarios on ExcessDeficit, however, compared on the bandwidth metric, the JMA dominates its JMS counterpart benefiting from its orientation capability. The JMA model may be at disadvantage in cases where it produces an incorrect orientation causing excessively large bounds in the incorrect direction due to scaling. A lower and upperboundspecific calibration might mitigate this problem. Upon visual inspection, the output of the JMA is appreciably better in bandwidth: Figure 6 shows samples on both datasets. In most instances the bounds behave as one would expect, expending the bulk of the bandwidth in the correct direction. An interesting question arises whether it is possible to utilize the asymmetric output as a correction on the base predictor. Our preliminary investigation shows that a naive combination leads to degradation in the base error, however, this question remains of interest for future work.
5 Conclusions
In this work we demonstrated that metamodeling (MM) provides a powerful new framework for uncertainty prediction. Through a systematic evaluation of the proposed MM variants we report considerable relative gains over a constant reference baseline and show that they not only outperform all competitive baselines but also show stability across drift scenarios. A jointly trained model integrating the base with a meta component fares best, followed by a whitebox setup, indicating that trainable whitebox features play an essential role in the task. Besides symmetric uncertainty, we also investigated generating asymmetric bounds using dedicated network nodes and showed their benefit in reducing the uncertainty bandwidth. We believe these results open an exciting new research avenue for uncertainty quantification in sequential regression.
References
 FactSheets: increasing trust in ai services through supplier’s declarations of conformity. IBM Journal of Research and Development 63 (4/5), pp. 6:1–6:13. External Links: Document, ISSN 00188646 Cited by: §1.
 The need for uncertainty quantification in machineassisted medical decision making. Nature Mach Intell 1, pp. 20–23. Cited by: §1.

Scheduled sampling for sequence prediction with recurrent neural networks
. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, Cambridge, MA, USA, pp. 1171–1179. Cited by: item 2.  Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31 (3), pp. 307–327. External Links: Document Cited by: §2, §3.2.3.
 Time Series Techniques for Economists. edition, Cambridge Books, Vol. , Cambridge University Press. External Links: Document, ISBN Cited by: §2.
 Optimal multistepahead prediction of arch/garch models and novas transformation. Econometrics 7 (3), pp. 34. External Links: ISSN 22251146, Document Cited by: §2.

Confidence scoring using whitebox metamodels with linear classifier probes
. InThe 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 1618 April 2019, Naha, Okinawa, Japan
, pp. 1467–1475. Cited by: §2.  Stateoftheart speech recognition with sequencetosequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4774–4778. Cited by: §3.1.
 Modified randomization tests for nonparametric hypotheses. Ann. Math. Statist. 28 (1), pp. 181–187. External Links: Document Cited by: §4.4.

Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation
. Econometrica 50 (4), pp. 987–1007. External Links: ISSN 00129682, 14680262 Cited by: §2, §3.2.3.  Online metalearning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1920–1930. Cited by: §2.
 A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1019–1027. Cited by: §2, §3.2.2.
 On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning  Volume 70, ICML’17, pp. 1321–1330. Cited by: §1, §3.3.2.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Document Cited by: §3.1.
 To trust or not to trust a classifier. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 5541–5552. Cited by: §1.

What uncertainties do we need in bayesian deep learning for computer vision?
. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5574–5584. Cited by: §2, §3.1, §3.2.1.  Ninth SPE comparative solution project: a reexamination of blackoil simulation. In SPE Reservoir Simulation Symposium, External Links: Document Cited by: §4.1.2.
 Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 Cited by: §4.2.1.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 6402–6413. Cited by: §2, §3.2.1.
 Towards deep learning models resistant to adversarial attacks. ArXiv abs/1706.06083. Cited by: §1.
 Accelerating physicsbased simulations using endtoend neural network proxies: an application in oil reservoir modeling. Frontiers in Big Data 2, pp. 33. External Links: Link, Document, ISSN 2624909X Cited by: §4.2.1.

Estimating the mean and variance of the target probability distribution
. In Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN’94), Vol. 1, pp. 55–60 vol.1. External Links: Document, ISSN null Cited by: §2, §4.2.2.  Crowd counting with decomposed uncertainty. In AAAI Conference on Artificial Intelligence, Cited by: §2, §3.2.1.

Probability, Random Variables and Stochastic Processes (2nd Edition)
. Journal of Vibration, Acoustics, Stress, and Reliability in Design 111 (1), pp. 123–125. External Links: ISSN 07393717, Document Cited by: §2.  Selfcritical sequence training for image captioning. CoRR abs/1612.00563. Cited by: §3.1.
 Evolutionary principles in selfreferential learning. on learning now to learn: the metametameta…hook. Diploma Thesis, Technische Universitat Munchen, Germany. External Links: Link Cited by: §2.

Wind power forecasting using multiobjective evolutionary algorithms for wavelet neural networkoptimized prediction intervals
. Applied Sciences 8 (2), pp. 185. External Links: ISSN 20763417, Document Cited by: §2, §3.3.1.  Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §2, §3.3.1.
 Char2Wav: endtoend speech synthesis. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 2426, 2017, Workshop Track Proceedings, Cited by: §3.1.
 Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §3.2.2.
 Why does the GARCH(1,1) model fail to provide sensible longer horizon volatility forecasts?. Econometrics Technical Report 0508003, University Library of Munich, Germany. External Links: Document Cited by: §2.
 Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2, NIPS’14, Cambridge, MA, USA, pp. 3104–3112. Cited by: §3.1.
 Advancing sequencetosequence based speech recognition. Proc. Interspeech 2019, pp. 3780–3784. Cited by: §3.1.
Appendix A Algorithm to Find a Scaling Factor
Appendix B Additional Implementation Details
b.1 MITV PreProcessing
As described in Section 4.1.1 and Table 1, the MITV dataset comes with 8 input features, among which 3 are categorical. Here we list the relevant parsing and encoding steps used in our setup. The raw time stamp information was parsed to extract additional features such as day of the week, day of the month, yearday fraction, etc. Table 4 shows the corresponding list. Standardization was performed on the input as well as output, as per Table 4, whereby the model predictions were transformed to their original range before calculating final metrics.
Feature  Range  Categorical  Embedding  Standardized  Final 
Name  Dimension  Dimension  
INPUT  
day_of_month  Y  3  N  3  
day_of_week  Y  3  N  3  
month  Y  3  N  3  
frac_yday  N    Y  1  
weather_type  Y  3  N  3  
holiday_type  Y  3  N  3  
temperature  N    Y  1  
rain_1h  N    Y  1  
snow_1h  N    Y  1  
clouds_all  N    Y  1  
Total  20  
OUTPUT  
traffic_volume  N    Y  1  
Total  1 
b.2 SPE9PR PreProcessing
The Table 5 lists details on the SPE9PR features (also refer to Section 4.1.2 and Table 1). The SPE9PR dataset contains input sequences of actions and output sequences of production rates. An action (feature type_of_well), at a particular time, represents a decision whether to drill, and if so, what type of well to drill (an injector or a producer well), or not to drill (encoded by ”0”), hence the cardinality is 3. In case of a drill decision, further specifications apply, namely the x and ylocation on the surface of the reservoir, local geological features at the site, and well control parameters. There are 15 vertical cells in the SPE9 each coming with 3 geological features (rel. permeability, rel. porosity, rock type), thus the local geology is a 45dimensional feature vector at a particular location. Finally, every well drilled so far may be controlled by a parameter called ”BottomHole Pressure” (BHP). Since we provision up to 100 wells of each of the two types, a 200dimensional vector arises containing BHP values for these wells at any given time. Standardization was performed on the input as well as output as specified in Table 5 whereby the model predictions were transformed to their original range before calculating and reporting final metrics.
Feature  Range  Categorical  Embedding  Standardized  Final 
Name  Dimension  Dimension  
INPUT  
type_of_well  Y  3  N  3  
location_x  Y  10  N  10  
location_y  Y  10  N  10  
vertical_geology  N    Y  45  
per_well_control  N    Y  200  
Total  258  
OUTPUT  
oil_prod_field_rate  N    Y  1  
gas_prod_field_rate  N    Y  1  
water_prod_field_rate  N    Y  1  
water_inj_field_rate  N    Y  1  
Total  4 
b.3 Training Setup
b.3.1 Hyperparameters
Hyperparameters have been determined in two ways: (1) learning rate, regularization, batch, and LSTM size were adopted from an unrelated experimental study performed on a modified reservoir SPE9 (Anonymized), (2) We used DEV2 to determine the dropout rates in the DOMS model. The value was chosen adhoc (as a midpoint between pure base and pure meta loss) without further optimization.
Hyper  Where  Value  Comment 

parameter  used  
Learning rate  all  0.001  Stage 1 training 
Learning rate  all  0.0002  Stage 2 training, see Section 4.2.1 
Batch size  all  100  
penalty coefficient  all, except DOMS  0.0001  
penalty coefficient  DOMS  0.0  
Dropout  DOMS  0.25/0.1/0.25  LSTM Input/State/Output (encoder and decoder) 
Base LSTM size  all  32  
Meta LSTM size  meta models  16  
in Eq. (1)  1.0/0.5/0.0  see Section 4.2.2 
b.4 Implementation Notes
All DNNs were implemented in Tensorflow 1.11. Training was done on a Tesla K80 GPU, with total training time ranging between 3 (MITV) and 24 (SPE9PR) hours. The GARCH Python implementation provided in the
arch library was used.Appendix C Additional Results
c.1 Individual Metrics
For a more detailed view of the averages in Table 2, we show a split by the individual metrics in Table 7, and, for the SPE9PR which has a total of four output variables, a split by the individual variables in Table 8.
Model  Deficit@0.01  Deficit@0.05  Deficit@0.1  Excess@0.01  Excess@0.05  Excess@0.1  MinCost  Average 

, MITV ”NonDrift” Scenario  
JMS  61.3  74.8  75.6  64.5  54.6  40.1  33.9  57.8 
WBMS  48.0  73.8  74.3  62.7  50.7  35.1  32.3  53.8 
JMV  33.9  31.9  32.5  35.8  31.4  28.2  16.8  20.4 
DOMS  28.1  24.8  23.1  11.5  3.3  1.6  2.8  13.1 
BBMS  9.7  27.9  16.3  18.8  18.7  18.7  0.8  0.4 
, MITV ”NonDrift” Scenario  
JMS  75.9  78.1  77.4  58.0  51.9  39.6  33.9  59.3 
WBMS  67.4  79.6  77.7  57.7  42.7  29.8  32.3  55.3 
JMV  4.4  38.9  35.8  29.1  27.1  26.3  16.8  25.5 
DOMS  78.2  63.9  62.3  35.7  32.4  49.2  2.8  12.8 
BBMS  61.6  21.5  15.4  61.0  12.5  15.9  0.9  1.2 
, MITV ”Drift” Scenario  
JMS  40.5  78.0  77.5  57.7  45.0  31.9  36.7  52.5 
WBMS  18.4  68.8  70.8  43.6  32.8  15.6  25.8  39.4 
JMV  3.8  29.0  30.9  23.0  25.5  21.6  13.8  20.0 
DOMS  23.4  23.7  24.7  11.6  14.2  11.4  7.3  16.6 
BBMS  38.2  5.7  8.7  11.8  12.2  14.1  0.2  8.9 
, MITV ”Drift” Scenario  
JMS  71.7  81.6  80.1  53.2  39.1  27.1  36.7  55.6 
WBMS  67.9  79.2  78.5  22.9  19.4  2.0  25.8  42.3 
JMV  27.2  28.5  32.7  29.1  25.9  19.6  13.8  17.5 
DOMS  38.4  18.9  23.4  2.8  16.6  13.3  7.4  17.3 
BBMS  6.2  13.2  13.1  17.5  17.4  18.6  0.3  3.0 
, SPE9PR ”NonDrift” Scenario  
JMS  55.7  55.8  56.8  46.1  42.5  37.7  30.8  46.5 
WBMS  50.9  54.5  55.6  47.4  41.5  35.4  28.5  44.8 
JMV  50.5  54.2  53.7  51.8  43.3  37.0  26.4  45.3 
DOMS  7.1  10.2  10.4  6.9  5.8  6.8  1.0  1.3 
BBMS  51.0  52.6  50.0  25.1  18.2  10.6  15.7  31.9 
, SPE9PR ”NonDrift” Scenario  
JMS  56.8  56.5  57.0  45.7  42.0  37.4  30.8  46.6 
WBMS  50.7  54.9  56.2  47.5  41.2  34.9  28.5  44.8 
JMV  51.4  56.0  55.2  51.6  42.3  35.5  26.4  45.5 
DOMS  7.7  9.5  10.1  7.0  5.3  6.4  1.0  1.4 
BBMS  76.4  61.2  54.8  11.5  10.6  5.0  15.7  30.3 
, SPE9PR ”Drift” Scenario  
JMS  65.4  69.0  68.4  52.6  50.0  46.3  44.5  56.6 
WBMS  63.7  68.0  66.9  53.5  48.1  41.8  39.5  54.5 
JMV  29.2  22.7  21.9  33.9  42.8  52.9  10.3  6.5 
DOMS  1.4  9.4  12.0  10.7  4.5  8.5  2.7  0.1 
BBMS  19.4  33.8  31.0  7.6  1.1  3.0  6.7  11.6 
, SPE9PR ”Drift” Scenario  
JMS  95.0  85.5  78.7  22.8  18.8  14.4  38.5  50.5 
WBMS  94.7  84.1  77.3  16.8  5.0  5.9  34.9  43.8 
JMV  40.0  25.2  12.2  48.3  31.6  16.9  11.3  4.4 
DOMS  16.4  10.1  9.1  23.4  23.5  26.9  1.8  5.2 
BBMS  40.5  9.4  7.9  0.8  17.3  9.0  5.0  12.8 
ExcessDeficit  
Model  OPR  WPR  GPR  WIN  Average  OPR  WPR  GPR  WIN  Average 
, SPE9PR ”NonDrift” Scenario  
JMS  0.12  0.28  0.17  0.09  0.17  18.25  66.45  53.60  47.61  46.48 
WBMS  0.12  0.28  0.17  0.09  0.16  17.78  62.08  52.43  47.03  44.83 
JMV  0.10  0.29  0.16  0.09  0.16  6.67  74.50  43.74  56.24  45.29 
DOMS  0.12  0.31  0.18  0.11  0.18  6.88  0.68  21.64  10.16  1.32 
BBMS  0.12  0.30  0.16  0.10  0.17  3.85  74.44  16.89  40.08  31.89 
, SPE9PR ”NonDrift” Scenario  
JMS  0.12  0.28  0.17  0.09  0.17  19.13  66.98  52.18  48.19  46.62 
WBMS  0.12  0.28  0.17  0.09  0.16  18.32  62.60  51.07  47.39  44.85 
JMV  0.10  0.29  0.16  0.09  0.16  6.92  75.74  43.12  56.23  45.50 
DOMS  0.12  0.31  0.18  0.11  0.18  6.56  0.46  21.94  10.27  1.39 
BBMS  0.12  0.30  0.16  0.10  0.17  3.84  76.80  7.71  40.63  30.33 
, SPE9PR ”Drift” Scenario  
JMS  0.28  0.30  0.45  0.13  0.29  49.97  24.13  119.73  32.54  56.59 
WBMS  0.31  0.34  0.45  0.14  0.31  51.49  27.09  102.64  36.77  54.50 
JMV  0.30  0.32  0.51  0.20  0.33  1.12  6.18  26.99  47.96  6.51 
DOMS  0.28  0.28  0.45  0.11  0.28  20.24  6.97  22.32  4.31  0.15 
BBMS  0.32  0.31  0.54  0.13  0.33  4.18  1.04  26.64  25.13  11.64 
, SPE9PR ”Drift” Scenario  
JMS  0.28  0.30  0.45  0.13  0.29  61.81  39.10  74.23  27.01  50.54 
WBMS  0.31  0.34  0.45  0.14  0.31  59.11  34.21  60.51  21.51  43.84 
JMV  0.30  0.32  0.51  0.20  0.33  2.15  9.15  19.51  13.28  4.38 
DOMS  0.28  0.28  0.45  0.11  0.28  0.81  27.32  15.73  10.09  5.22 
BBMS  0.32  0.31  0.54  0.13  0.33  2.55  4.67  20.74  23.36  12.83 
c.2 Additional Visualizations
In addition to the sample visualizations shown in Section 2 for the JMS, JMV, and JMA systems, here we show same sections of the data and visualize output of all systems. Figures 7 and 8 show the first simulation in the test set of the SPE9PR dataset for the drift and nondrift condition and all its output components, respectively. Figures 9 and 10 show the output on the MITV drift and nondrift condition, respectively. For each model, a miss rate value of 0.1 across the entire test set was used in the visualizations.
Comments
There are no comments yet.