1 Introduction
Machine learning models are wellcalibrated if the probability associated with the predicted class reflects its correctness likelihood relative to the ground truth. The output probabilities of modern neural networks are often poorly calibrated
(Guo et al., 2017). For instance, typical neural networks with a softmax activation tend to assign high probabilities to outofdistribution samples (Gal and Ghahramani, 2016b). Providing uncertainty estimates is important for model interpretability as it allows users to assess the extent to which they can trust a given prediction (Jiang et al., 2018). Moreover, wellcalibrated output probabilities are crucial in several use cases. For instance, when monitoring medical timeseries data (see Figure 3(a)), hospital staff should also be alerted when there is a lowconfidence prediction concerning a patient’s health status.Bayesian neural networks (BNNs), which place a prior distribution on the model’s parameters, are a popular approach to modeling uncertainty. BNNs often require more parameters, approximate inference, and depend crucially on the choice of prior (Gal, 2016; Lakshminarayanan et al., 2017). Applying dropout both during training and inference can be interpreted as a BNN and provides a more efficient method for uncertainty quantification (Gal and Ghahramani, 2016b). The dropout probability, however, needs to be tuned and, therefore, leads to a tradeoff between predictive error and calibration error.
Sidestepping the challenges of Bayesian NNs, we propose an orthogonal approach to quantify the uncertainty in recurrent neural networks (RNNs). At each time step, based on the current hidden (and cell) state, the model computes a probability distribution over a finite set of states. The next state of the RNN is then drawn from this distribution. We use the Gumbel softmax trick
(Gumbel, 1954; Kendall and Gal, 2017; Jang et al., 2017) to perform MonteCarlo gradient estimation. Inspired by the effectiveness of temperature scaling (Guo et al., 2017) which is usually applied to trained models, we learn the temperature of the Gumbel softmax distribution during training to control the concentration of the state transition distribution. Learning as a parameter can be seen as entropy regularization (Szegedy et al., 2016; Pereyra et al., 2017; Jang et al., 2017). The resulting model, which we name ST, defines for every input sequence a probability distribution over statetransition paths similar to a probabilistic state machine. To estimate the model’s uncertainty for a prediction, STis run multiple times to compute mean and variance of the prediction probabilities.
We explore the behavior of STin a variety of tasks and settings. First, we show that STcan learn deterministic and probabilistic automata from data. Second, we demonstrate on realworld classification tasks that STlearns well calibrated models. Third, we show that STis competitive in outofdistribution detection tasks. Fourth, in a reinforcement learning task, we find that STis able to trade off exploration and exploitation behavior better than existing methods. Especially the outofdistribution detection and reinforcement learning tasks are not amenable to posthoc calibration approaches (Guo et al., 2017) and, therefore, require a method such as ours that is able to calibrate the probabilities during training.
2 Uncertainty in Recurrent Neural Networks
2.1 Background
An RNN is a function defined through a neural network with parameters that is applied over time steps: at time step , it reuses the hidden state of the previous time step and the current input to compute a new state , . Some RNN variants such as LSTMs have memory cells and apply the function at each step. A vanilla RNN maps two identical input sequences to the same state and it is therefore not possible to measure uncertainty of a prediction by running inference multiple times. Furthermore, it is known that passing through a softmax transformation leads to overconfident predictions on outofdistribution samples and poorly calibrated probabilities (Guo et al., 2017). In a Bayesian RNN the weight matrices are drawn from a distribution and, therefore, the output is an average of an infinite number of models. Unlike vanilla RNNs, Bayesian RNNs are stochastic and it is possible to compute average and variance for a prediction. Using a prior to integrate out the parameters during training also leads to a regularization effect. However, there are two major and often debated challenges of BNNs: the right choice of prior and the efficient approximation of the posterior.
With this paper, we sidestep these challenges and model the uncertainty of an RNN through probabilistic state transitions between a finite number of learnable states . Given a state , we compute a stochastic probability distribution over the learnable states. Hence, for the same state and input, the RNN might move to different states in different runs. Instead of integrating over possible weights, as in the case of BNNs, we sum over all possible state sequences and weigh the classification probabilities by the probabilities of these sequences. Figure 4 illustrates the proposed approach and contrasts it with vanilla and Bayesian RNNs. The proposed method combines two building blocks. The first is stateregularization (Wang and Niepert, 2019) as a way to compute a probability distribution over a finite set of states in an RNN. Stateregularization, however, is deterministic and therefore we utilize the second building block, the Gumbel softmax trick (Gumbel, 1954; Maddison et al., 2017; Jang et al., 2017) to sample from a categorical distribution. Combining the two blocks allows us to create a stochastic state RNN which can model uncertainty. Before we formulate our method, we first introduce the necessary two building blocks.
Deterministic StateRegularized RNNs.
State regularization (Wang and Niepert, 2019) extends RNNs by dividing the computation of the hidden state
into two components. The first component is an intermediate vector
computed in the same manner as the standard recurrent component, . The second component models probabilistic state transitions between a finite set of learnable states , …, , where and which can also be written as a matrix .is randomly initialized and learnt during backpropagation like any other network weight. At time step
, given an , the transition over next possible states is computed by: , where and is some predefined function. In Wang and Niepert (2019), was a matrixvector product followed by a Softmax function that ensures . The hidden state is then computed by(1) 
where is another function, e.g. to compute the average. Because Equation (1) is deterministic, it cannot capture and estimate epistemic uncertainty.
MonteCarlo Estimator with Gumbel Trick.
The Gumbel softmax trick is an instance of a pathwise MonteCarlo gradient estimator (Gumbel, 1954; Maddison et al., 2017; Jang et al., 2017). With the Gumbel trick, it is possible to draw samples from a categorical distribution given by paramaters , that is, , where is the number of categories and are i.i.d. samples from the Gumbel, that is, . Because the operator breaks endtoend differentiability, the categorical distribution can be approximated using the differentiable softmax function (Jang et al., 2017; Maddison et al., 2017). This enables us to draw a dimensional sample vector , where is the dimensional probability simplex.
2.2 Stochastic FiniteState RNNs (ST)
Our goal is to make state transitions stochastic and uitilize them to measure uncertainty: given an input sequence, the uncertainty is modeled via the probability distribution over possible statetransition paths (see right half of Figure 4). We can achieve this by setting to be a matrixvector product and using to sample from a Gumbel softmax distribution with temperature parameter . Applying Monte Carlo estimation, at each time step , we sample a distribution over state transition probabilities from the Gumbel softmax distribution with current parameter , where each state transition has the probability
(2) 
The resulting can be seen as a probability distribution that judges how important each learnable state is. The new hidden state can now be formed either as an average, (the “soft” Gumbel estimator), or as a onehot vector, . For the latter, gradients can be estimated using the straightthrough estimator. Empirically, we found the average to work better. By sampling from the Gumbel softmax distribution at each time step, the model is stochastic and it is possible to measure variance across predictions and, therefore, to estimate epistemic uncertainty. For more theoretical details we refer the reader to Appendix 2.3.
The parameter of the Gumbel softmax distribution is learned during training (Jang et al., 2017). This allows us to directly adapt probabilistic RNNs to the inherent uncertainty of the data. Intuitively, the parameter influences the concentration of the categorical distribution, that is, the larger the more uniform the distribution. Since we influence the state transition uncertainty with the learned temperature , we refer to our model as ST. We provide an ablation experiment of learning versus keeping it fixed in Appendix E.
Figure 5 illustrates the proposed model. Given the previous hidden state of an RNN, first an intermediate hidden state is computed using a standard RNN cell. Next, the intermediate representation is multiplied with learnable states arranged as a matrix , resulting in . Based on , samples are drawn from a Gumbel softmax distribution with learnable temperature parameter . The sampled probability distribution represents the certainty the model has in moving to the other states. Running the model on the same input several times (drawing MonteCarlo samples) allows us to estimate the uncertainty of the STmodel.
2.3 Aleatoric and Epistemic Uncertainty
Let be a set of class labels and be a set of training samples. For a classification problem and a given STmodel with states , we can write for every
(3) 
where and the sum is over all possible paths (state sequences) of length . Moreover, is the probability of path given input sequence and is the probability of class given that we are in state . Instead of integrating over possible weights, as in the case of BNNs, with STwe integrate (sum) over all possible paths and weigh the class probabilities by the path probabilities. The above model implicitly defines a probabilistic ensemble of several deterministic models, each represented by a particular path. As mentioned in a recent paper about aleatoric and epistemic uncertainty in ML (Hüllermeier and Waegeman, 2019): “the variance of the predictions produced by an ensemble is a good indicator of the (epistemic) uncertainty in a prediction.”
Let us now make this intuition more concrete using recently proposed measures of aleatoric and epistemic uncertainty (Depeweg et al., 2018); further discussed in (Hüllermeier and Waegeman, 2019). In Equation (19) of (Hüllermeier and Waegeman, 2019) the total uncertainty is defined as the entropy of the predictive posterior distribution
The above term includes both aleatoric and epistemic uncertainty (Depeweg et al., 2018; Hüllermeier and Waegeman, 2019). Now, in the context of Bayesian NNs, where we have a distribution over the weights of a neural network, the expectation of the entropies wrt said distribution is the aleatoric uncertainty (Equation 20 in Hüllermeier and Waegeman (2019)):
Fixing the parameter weights to particular values eliminates the epistemic uncertainty. Finally, the epistemic uncertainty is obtained as the difference of the total and aleatoric uncertainty (Equation 21 in Hüllermeier and Waegeman (2019)):
Now, let us return to finitestate probabilistic RNNs. Here, the aleatoric uncertainty is the expectation of the entropies with respect to the distribution over the possible paths :
where is the probability of class conditioned on and a particular path . The epistemic uncertainty for probabilistic finitestate RNNs can then be computed by
Probabilistic finitestate RNNs capture epistemic uncertainty when the equation above is nonzero. As an example let us take a look at the two STmodels given in Figure 6. Here, we have for input two class labels and , three states (), and two paths. In both cases, we have that
Looking at the term for the aleatoric uncertainty, for the STdepicted on the left side we have In contrast, for the STdepicted on the right side we have Consequently, the left side SThas an epistemic uncertainty of but the right side STexhibits an epistemic uncertainty of .
This example illustrates three main observations. First, we can represent epistemic uncertainty through distributions over possible paths. Second, the more spiky the transition distributions, the more deterministic the behavior of the STmodel, and the more confident it becomes with its prediction by shrinking the reducible source of uncertainty (epistemic uncertainty). Third, both models are equally calibrated as their predictive probabilities are, in expectation, identical for the same inputs. Hence, STis not merely calibrating predictive probabilities but also captures epistemic uncertainty. Finally, we want to stress the connection between the parameter (the temperature) and the degree of (epistemic) uncertainty of the model. For small the STmodel behavior is more deterministic and, therefore, has a lower degree of epistemic uncertainty. For instance, the STon the left in Figure 6 has epistemic uncertainty because all transition probabilities are deterministic. Empirically, we find that the temperature and, therefore, the epistemic uncertainty often reduces during training, leaving the irreducible uncertainty (aleatoric) to be the main source of uncertainty.
3 Related Work
Uncertainty. Uncertainty quantification for safetycritical applications (Krzywinski and Altman, 2013) has been explored for deep neural nets in the context of Bayesian learning (Blundell et al., 2015; Gal and Ghahramani, 2016b; Kendall and Gal, 2017; Kwon et al., 2018). Bayes by Backprop (BBB) (Blundell et al., 2015) is a variational inference scheme for learning a distribution over weights in neural networks and assumes that the weights are distributed normally, that is,
. The principles of bayesian neural networks (BNNs) have been applied to RNNs and shown to result in superior performance compared to vanilla RNNs in natural language processing (NLP) tasks
(Fortunato et al., 2017). However, BNNs come with a high computational cost because we need to learn and for each weight in the network, effectively doubling the number of parameters. Furthermore, the prior might not be optimal and approximate inference could lead inaccurate estimates (Kuleshov et al., 2018). Dropout (Hinton et al., 2012; Srivastava et al., 2014) can be seen as a variational approximation of a Gaussian Process (Gal and Ghahramani, 2016b, a). By leaving dropout activated at prediction time, it can be used to measure uncertainty. However, the dropout probability needs to be tuned which leads to a tradeoff between predictive error and calibration error (see Figure 32 in the Appendix for an empirical example). Deep ensembles (Lakshminarayanan et al., 2017) offer a nonBayesian approach to measure uncertainty by training multiple separate networks and ensembling them. Similar to BNNs, however, deep ensembles require more resources as several different RNNs need to be trained. We show that STis competitive to deep ensembles without the resource overhead. Recent work (Hwang et al., 2020)describes a samplefree uncertainty estimation for Gated Recurrent Units (SPGRU)
(Chung et al., 2014), which estimates uncertainty by performing forward propagation in a series of deterministic linear and nonlinear transformations with exponential family distributions. STestimates uncertainties through the stochastic transitions between two consecutive recurrent states.Calibration. Platt scaling (Platt and others, 1999) is a calibration method for binary classification settings and has been extend to multiclass problems (Zadrozny and Elkan, 2002) and the structured prediction settings (Kuleshov and Liang, 2015). (Guo et al., 2017)
extended the method to calibrate modern deep neural networks, particularly networks with a large number of layers. In their setup, a temperature parameter for the final softmax layer is adjusted only after training. In contrast, our method learns the temperature and, therefore, the two processes are not decoupled. In some tasks such as timeseries prediction or RL it is crucial to calibrate during training and not posthoc.
Deterministic & Probabilistic Automata Extraction. Deterministic Finite Automata (DFA) have been used to make the behavior of RNNs more transparent. DFAs can be extracted from RNNs after an RNN is trained by applying clustering algorithms like means to the extracted hidden states (Wang et al., 2018) or by applying the exact learning algorithm L (Weiss et al., 2018). Posthoc extraction, however, might not recover faithful DFAs. Instead, (Wang and Niepert, 2019) proposed stateregularized RNNs where the finite set of states is learned alongside the RNN by using probabilistic state transitions. Building on this, we use the Gumbel softmax trick to model stochastic state transitions, allowing us to learn probabilistic automata (PAs) (Rabin, 1963) from data.
Hidden Markov Models (HMMs) & StateSpace Models (SSMs) & RNNs. HMMs are transducerstyle probabilistic automata, simpler and more transparent models than RNNs. (Bridle, 1990) explored how a HMM can be interpreted as an RNNs by using full likelihood scoring for each word model. (Krakovna and DoshiVelez, 2016) studied various combinations of HMMs and RNNs to increase the interpretability of RNNs. There have also been ideas on incorporating RNNs to HMMs to capture complex dynamics (Dai et al., 2017; Doerr et al., 2018). Another relative group of work is SSMs, e.g., rSLDS (Linderman et al., 2017), Kalman VAE (Fraccaro et al., 2017), PlaNet (Hafner et al., 2019) and RKN (Becker et al., 2019). They can be extended and viewed as another way to inject stochasticity to RNNbased architectures. In contrast, STmodels stochastic finitestate transition mechanisms endtoend in conjunction with modern gradient estimators to directly quantify and calibrate uncertainty and the underlying probabilistic system. This enables STto approximate and extract the probabilistic dynamics in RNNs.
4 Experiments
The experiments are grouped into five categories. First, we show that it is possible to use STto learn deterministic and probabilistic automata from language data (Sec. 4.1). This demonstrates that STcan capture and recover the stochastic behavior of both deterministic and stochastic languages. Second, we demonstrate on classification tasks (Sec. 4.2) that STperforms better than or similar to existing baselines both in terms of predictive quality and model calibration. Third, we compare STwith existing baselines using outofdistribution detection tasks (Sec. 4.3). Fourth, we conduct reinforcement learning experiments where we show that the learned parameter can calibrate the explorationexploitation tradeoff during learning (Appendix D), leading to a lower sample complexity. Fifth, we report results on a regression task (Appendix C).
Models.
We compare the proposed STmethod to four existing models. First, a vanilla LSTM (LSTM). Second, a Bayesian RNN (BBB) (Blundell et al., 2015; Fortunato et al., 2017), where each network weight
is sampled from a Gaussian distribution
, , with and . To estimate model uncertainty, we keep the same sampling strategy as employed during training (rather than using ). Third, a RNN that employs Variational Dropout (VD) (Gal and Ghahramani, 2016a). In variational dropout, the same dropout mask is applied at all time steps for one input. To use this method to measure uncertainty, we keep the same dropout probability at prediction time (Gal and Ghahramani, 2016a). Fourth, a deep ensemble of a LSTM base model (Lakshminarayanan et al., 2017). For STwe compute the new hidden state using the soft version of the Gumbel softmax estimator (Jang et al., 2017). All models contain only a single LSTM layer, are implemented in Tensorflow
(Abadi et al., 2015), and use the Adam (Kingma and Ba, 2015) optimizer with initial learning rate .4.1 Deterministic & Probabilistic Automata Extraction
We aim to investigate the extent to which STcan learn deterministic and probabilistic automata from sequences generated by regular and stochastic languages. Since the underlying languages are known and the data is generated using these languages, we can exactly assess whether STcan recover these languages. We refer the reader to the appendix A for a definition of deterministic finite automata (DFA) and probabilistic automata (PA). The set of languages recognized by DFAs and PAs are referred to as, respectively, regular and stochastic languages. For the extraction experiments we use the GRU cell as it does not have a cell state. This allows us to read the Markovian transition probabilities for each stateinput symbol pair directly from the trained ST.
Regular Languages. We conduct experiments on the regular language defined by Tomita grammar
. This language consists of any string without an odd number of consecutive 0’s after an odd number of consecutive l’s
(Tomita, 1982). Initializing , we train ST(with states) to learn a STmodel to represent this grammar and then extract DFAs (see Appendix A.4 for the extraction algorithm). In principle, # of classes, the model learns to select a finite set of (meaningful) states to represent a language, as shown in Appendix Figure 8. STis able to learn that the underlying language is deterministic, learning the temperature parameter accordingly and the extraction produces the correct underlying DFA. The details are discussed in the appendix.Stochastic Languages We explore whether it is possible to recover a PA from a STmodel trained on the data generated by a given PA. While probabilistic deterministic finite automata (PDFAs) have been extracted previously (Weiss et al., 2019), to the best of our knowledge, this is the first work to directly learn a PA, which is more expressive than a PDFA (Denis and Esposito, 2004), by extraction from a trained RNN. We generate training data for two stochastic languages defined by the PAs shown in Figure 11 (a) and (c). Using this data, we train a STwith a GRU with states and we directly use the Gumbel softmax distribution to approximate the probability transitions of the underlying PA (see Appendix A.5 for more details and the extraction algorithm). Figures 11 (b, d) depict the extracted PAs. For both stochastic languages the extracted PAs indicate that STis able to learn the probabilistic dynamics of the groundtruth PAs.
Dataset  BIH  IMDB  

Metrics  PE  ECE  MCE  PE  ECE  MCE 
LSTM  1.40  0.78  35.51  10.42  3.64  11.24 
Ensembles  1.51  0.72  31.49  10.56  3.45  12.16 
BBB  4.69  0.54  12.44  10.84  2.10  6.15 
VD  1.51  0.80  24.71  10.56  3.41  14.08 
ST  2.12  0.45  23.11  10.95  0.89  3.70 
ST  2.11  0.40  21.73  11.16  3.38  9.09 
offers the best and reliable tradeoff between predictive error and calibration errors. Furthermore, it does not require more parameters as BBB (double) or Deep Ensemble (order of magnitude more) nor does a hyperparameter has to be tuned as in VD. Stochastic predictions are averaged across 10 independent runs and their variance is reported. Best and second best results are marked in bold and underlined (PE: bold models are significantly different at level
). An ablation experiment with posthoc temperature scaling is in Appendix B.3.1.4.2 Model Calibration
We evaluate ST’s prediction and calibration quality on two classification tasks. The first task is heartbeat classification with 5 classes where we use the MITBIH arrhythmia dataset (Goldberger et al., 2000; Moody and Mark, 2001). It consists of halfhour excerpts of electrocardiogram (ECG) recordings. To preprocess the recordings we follow Kachuee et al. (2018)
(Part III.A). The second task is sentiment analysis where natural language text is given as input and the problem is binary sentiment classification. We use the IMDB dataset
(Maas et al., 2011), which contains reviews of movies that are to be classified as being positive or negative. For further dataset and hyperparameter details, please see Appendix B.2.We compute the output of each model times and report mean and variance. For the deep ensemble, we train distinct LSTM models. For VD, we tuned the best dropout rate in and for BBB we tuned and , choosing the best setup by lowest predictive error achieved on validation data. For ST, we evaluate both, setting the number of states to the number of output classes , BIH and IMDB, respectively and to a fixed value . We initialize with and use a dense layer before the softmax classification. For more details see Appendix, Table 3. With a perfectly calibrated model, the probability of the output equals the confidence of the model. However, many neural networks do not behave this way Guo et al. (2017). To assess the extent to which a model is calibrated, we use reliability diagrams as well as Expected Calibration Error (ECE) and Maximum Calibration Error (MCE). Please see Appendix B.1 for details. For VD, the best dropout probability on the validation set is 0.05. Lower is better for all metrics. For PE, all models are marked bold if there is no significant difference at level to the best model.
Results. The results are summarized in Table 1. For the BIH dataset, the vanilla LSTM achieves the smallest PE with a significant difference to all other models at using an approximate randomization test (Noreen, 1989). It cannot, however, measure uncertainty and suffers from higher ECE and MCE. Similarly, VD exhibits a large MCE. The situation is reversed for BBB were we find a vastly higher PE, but lower MCE. In contrast, STachieves overall good results: PE is only slightly higher than the best model (LSTM) while achieving the lowest ECE and the second lowest MCE. The calibration plots of Figure 16 show that STis wellcalibrated in comparison to the other models. For the IMDB dataset, SThas a slightly higher PE than the best models, but has the lowest ECE and MCE offering a good tradeoff. The calibration plots of IMDB can be found in the Appendix, Figure 29. In addition to achieving consistently competitive results across all metrics, SThas further advantages compared to the other methods. The deep ensemble doubles the number of parameters by the number of model copies used. BBB requires the doubling of parameters and a carefully chosen prior, where STdoes only require a slight increase in number of parameters compared to a vanilla LSTM. VD requires the tuning of the hyperparameter for the dropout probability, which leads to a tradeoff between predictive and calibration errors (see Appendix B.3.2).
Datasets  IMDB(In)/Customer(Out)  IMDB(In)/Movie(Out)  

Method  Accuracy  OAUPR  OAUROC  Accuracy  OAUPR  OAUROC 
LSTM (max. prob.)  87.9  72.5  77.3  88.1  66.7  71.6 
VD 0.8  88.5  74.8  80.6  87.5  69.3  74.7 
BBB  87.6  67.4  72.0  87.6  67.1  71.9 
ST  88.3  80.1  84.5  88.1  75.1  81.0 
VD 0.8  88.5  67.8  76.5  87.5  63.8  71.8 
BBB  87.6  76.0  75.4  87.6  76.8  75.6 
ST  86.5  78.9  82.8  85.9  74.0  78.7 
ST  88.3  65.0  76.5  88.1  64.1  75.1 
Ensembles (max.prob.)  88.6  78.9  84.4  88.3  74.5  79.6 
Ensembles (variance)  88.6  79.7  84.0  88.3  75.8  79.9 
tableResults (averaging on 10 runs for VD, BBB, ST. Ensembles are based on 10 models) of the outofdistribution (OOD) detection with maxprobability based (top), variance of maxprobability based (middle) and ensembles (bottom). STexhibits very competitive performance.
4.3 OutOfDistribution Detection
We explore the ability of STto estimate uncertainty by making it detect outofdistribution (OOD) samples following prior work (Hendrycks and Gimpel, 2017). The indistribution dataset is IMDB and we use two OOD datasets: the Customer Review test dataset (Hu and Liu, 2004) and the Movie review test dataset (Pang et al., 2002), which consist of, respectively, 500 and 1,000 samples. As in Hendrycks and Gimpel (2017)
, we use the evaluation metrics AUROC and AUPR. Additionally, we report the accuracy on indistribution samples. For VD we select the dropout rate from the values
and for BBB we select the best and , based on best AUROC and AUPR. For STwe used and . Beside using the maximum probability of the softmax (MP) as baseline (Hendrycks and Gimpel, 2017), we also consider the variance of the maximum probability (VarMP) across runs. The number of indomain samples is set to be the same as the number of outofdomain samples from IMDB (Maas et al., 2011). Hence, a random baseline should achieve 50% AUROC and AUPR.Results. Table 4.2 lists the results. STand deep ensembles are the best methods in terms of OOD detection and achieve better results for both MP and VarMP. The MP results for STare among the best showing that the proposed method is able to estimate outofdistribution uncertainty. We consider these results encouraging especially considering that we only tuned the number of learnable finite states in ST. Interestingly, a larger number of states improves the variancebased outofdistribution detection of ST. In summary, STis highly competitive in the OOD task.
5 Discussion and Conclusion
We proposed ST, a novel method to model uncertainty in recurrent neural networks. STachieves competitive results relative to other strong baselines (VD, BBB, Deep Ensembles), while circumventing some of their disadvantages, e.g., extensive hyperparameters tuning and doubled number of parameters. STprovides a novel mechanism to capture the uncertainty from (sequential) data over time steps. The key characteristic which distinguishes STfrom baseline methods is its ability to model discrete and stochastic state transitions using modern gradient estimators at each time step.
References
 TensorFlow: largescale machine learning on heterogeneous distributed systems. Cited by: §4.

Reinforcement learning with long shortterm memory
. In Advances in Neural Information Processing Systems (NIPS), Cited by: Appendix D. 
Recurrent kalman networks: factorized inference in highdimensional deep feature spaces
. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, CA, USA, Proceedings of Machine Learning Research, Vol. 97, pp. 544–552 (english). External Links: Link Cited by: §3.  Weight Uncertainty in Neural Networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML), Cited by: §3, §4.

Alphanets: a recurrent ‘neural’network architecture with a hidden markov model interpretation
. Speech Communication 9 (1), pp. 83–92. Cited by: §3.  OpenAI Gym. External Links: arXiv:1606.01540 Cited by: Appendix D.
 Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.
 Recurrent hidden semimarkov model. In International Conference on Learning Representations (ICLR), Cited by: §3.
 The comparison and evaluation of forecasters. The Statistician. Cited by: §B.1.

Learning classes of probabilistic automata.
In
International Conference on Computational Learning Theory
, pp. 124–139. Cited by: §4.1. 
Decomposition of uncertainty in bayesian deep learning for efficient and risksensitive learning
. In International Conference on Machine Learning, pp. 1184–1193. Cited by: §2.3.  Probabilistic recurrent statespace models. In International Conference on Machine Learning, pp. 1280–1289. Cited by: §3.
 Bayesian Recurrent Neural Networks. CoRR abs/1704.02798. External Links: 1704.02798 Cited by: §3, §4.

A disentangled recognition and nonlinear dynamics model for unsupervised learning
. In Advances in Neural Information Processing Systems, pp. 3601–3610. Cited by: §3.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. In Advances in Neural Information Processing Systems (NIPS), Cited by: §3, §4.
 Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of The 33rd International Conference on Machine Learning (PMLR), New York, New York, USA. Cited by: §1, §1, §3.
 Uncertainty in Deep Learning. Ph.D. Thesis, University of Cambridge. Cited by: §1.
 PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. E215—20. External Links: ISSN 00097322, Document Cited by: §4.2.

Coevolving recurrent neurons learn deep memory pomdps
. InProceedings of the 7th annual conference on Genetic and evolutionary computation
, pp. 491–498. Cited by: Appendix D.  Statistical Theory of Extreme Values and Some Practical Applications. A Series of Lectures.. Number 33. US Govt. Print. Office. Cited by: §1, §2.1, §2.1.
 On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Cited by: Appendix B, §1, §1, §1, §2.1, §3, §4.2.
 Learning latent dynamics for planning from pixels. In International Conference on Machine Learning, pp. 2555–2565. Cited by: §3.
 A baseline for detecting misclassified and outofdistribution examples in neural networks. ICLR. Cited by: §4.3.
 Improving neural networks by preventing coadaptation of feature detectors. CoRR abs/1207.0580. External Links: 1207.0580 Cited by: §3.
 Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 168–177. Cited by: §4.3.
 Aleatoric and epistemic uncertainty in machine learning: a tutorial introduction. arXiv preprint arXiv:1910.09457. Cited by: §2.3, §2.3.

Samplingfree uncertainty estimation in gated recurrent units with applications to normative modeling in neuroimaging.
In
Uncertainty in Artificial Intelligence
, pp. 809–819. Cited by: §3.  Categorical Reparameterization with GumbelSoftmax. In 5th International Conference on Learning Representations (ICLR), Cited by: Figure 39, §1, §2.1, §2.1, §2.2, §4.
 To trust or not to trust a classifier. In Advances in neural information processing systems, pp. 5541–5552. Cited by: §1.
 ECG Heartbeat Classification: A Deep Transferable Representation. In IEEE International Conference on Healthcare Informatics (ICHI), Cited by: §B.2, §4.2.

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?
. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §3.  Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.
 Increasing the interpretability of recurrent neural networks using hidden markov models. arXiv preprint arXiv:1606.05320. Cited by: §3.
 Points of significance: Importance of being uncertain. Nature Methods 10 (9), pp. 809–810. External Links: ISSN 15487091, Document Cited by: §3.
 Accurate Uncertainties for Deep Learning Using Calibrated Regression. In Proceedings of the 35th International Conference on Machine Learning (ICML), Cited by: §3.
 Calibrated structured prediction. In Advances in Neural Information Processing Systems, pp. 3474–3482. Cited by: §3.
 Uncertainty quantification using Bayesian neural networks in classification: Application to ischemic stroke lesion segmentation. In International Conference on Medical Imaging with Deep Learning (MIDL), Cited by: §3.
 Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems (NIPS), Cited by: §1, §3, §4.
 Bayesian learning and inference in recurrent switching linear dynamical systems. In Artificial Intelligence and Statistics, pp. 914–922. Cited by: §3.
 Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (NAACL), Cited by: §4.2, §4.3.

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
. In 5th International Conference on Learning Representations (ICLR), Cited by: §2.1, §2.1.  Calibrated modelbased deep reinforcement learning. In International Conference on Machine Learning, pp. 4314–4323. Cited by: Appendix D.
 The impact of the MITBIH Arrhythmia Database. IEEE Engineering in Medicine and Biology Magazine 20 (3), pp. 45–50. External Links: ISSN 07395175, Document Cited by: §4.2.
 Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence (AAAI), Cited by: §B.1.

Predicting Good Probabilities with Supervised Learning
. In Proceedings of the 22nd International Conference on Machine Learning (ICML), Cited by: §B.1.  Computer Intensive Methods for Testing Hypotheses: An Introduction. Wiley, New York. Cited by: §4.2.
 Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002), pp. 79–86. Cited by: §4.3.
 Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548. Cited by: Figure 39, §1.

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
. Advances in large margin classifiers 10 (3), pp. 61–74. Cited by: §3.  Probabilistic automata. Information and Control 6 (3), pp. 230–245. Cited by: §A.3, §3.
 Reinforcement learning with recurrent neural networks. Cited by: Appendix D.
 Knowledge Extraction and Recurrent Neural Networks: An Analysis of an Elman Network Trained on a Natural Language Learning Task. In Proceedings of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, Cited by: §A.1, §A.4.
 Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. Cited by: §3.

Rethinking the inception architecture for computer vision.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, Cited by: Figure 39, §1.  Dynamic construction of finite automata from examples using hillclimbing. In Proceedings of the Fourth Annual Conference of the Cognitive Science Society, pp. 105–108. Cited by: §4.1.
 StateRegularized Recurrent Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Cited by: §A.1, §2.1, §2.1, §3.
 An Empirical Evaluation of Rule Extraction from Recurrent Neural Networks. Neural Computation 30 (9), pp. 2568–2591. Cited by: §A.4, §3.
 Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research. Cited by: §3.
 Learning Deterministic Weighted Automata with Queries and Counterexamples. In Advances in Neural Information Processing Systems 32, pp. 8558–8569. Cited by: §4.1.
 Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 694–699. Cited by: §3.
Appendix A Deterministic & Probabilistic Automata Extraction
For the STmodels we use an embedding layer and an LSTM layer (both with 100 hidden units) and a dense layer which accepts the last hidden state output and has two output neurons (accept, reject). The training objective aims to minimize a crossentropy loss for a binary classification problem (accept or reject).
a.1 Definition of Deterministic Finite Automata
A Deterministic Finite Automata (DFA) is a 5tuple consisting of a finite set of states ; a finite set of input tokens (called the input alphabet); a transition function ; a start state ; and a set of accept states . Given a DFA and an input, it is possible to follow how an accept or reject state is reached. DFAs can be extracted from RNNs in order to offer insights into the workings of the RNN, making it more interpretable. srRNNs (Wang and Niepert, 2019) extract DFAs from RNNs by counting the number of transitions that have occurred between a state and its subsequent states, given a certain input (Schellhammer et al., 1998). However, this extraction method is deterministic and cannot give any uncertainty estimates for the extracted DFA. By adding stochasticity using the Gumbel softmax distribution, we can additionally offer uncertainty measures for the state transitions.
a.2 Extracting Automata for Regular Languages
With this experiment, we want to explore two features of ST. First, we want to understand how changes as training progresses (see Figure 20 (a)). At the beginning of training, first increases, allowing the model to explore the state transitions and select the states which will represent the corresponding grammar (in our example, the model selects 5 out of 10 states to represent Tomita 3, see Figure 20 (b, c)). Later, decreases and the transition uncertainty is calibrated to have tighter bounds, becoming more deterministic). Second, we want to see if STcan model the uncertainty in transitions and adaptively learn to calibrate the state transition distributions. For this, we extract DFAs at two different iterations (see Figure 20 (b, c)). After 50k iterations, a correct DFA can be extracted. However, the transitions are not well calibrated. The ideal transition should be deterministic and have transition probability close to 1. For example, at state 7, for input “1”, only 53% of the time the model transitions to the correct state 9. In contrast, after 250k iterations, the transitions are well calibrated and all transition are almost deterministic. At the same time, has lowered, indicating that the model has become more certain.
a.3 Definition of Probabilistic Automata
The stochasticity in STalso allows us to extract Probabilistic Automata (PA) (Rabin, 1963) which have a stronger representational power than DFAs. A PA is defined as a tuple consisting of a finite set of states ; a finite set of input tokens ; a transition function , where is the transition probability for a particular state, and denotes the power set of ; a start state and a set of accept states .
a.4 Extracting DFAs with Uncertainty Information
Let , be the probability of the transition to state , given current state and current input symbol at a given time step . We query each training sample to model and record the transition probability for each input symbol. The DFA extraction algorithms in (Schellhammer et al., 1998; Wang et al., 2018) are based on the count of transitions. In contrast, our extraction algorithm utilizes the transition probability, as described in Algorithm 1.
a.5 Extracting Probabilistic Automata
We first define two probabilistic automata as shown in Figure 11(a)(c) to generate samples from it^{1}^{1}1We generate samples without combining identical samples. For instance, consider a sequence drawn from PA2 which has probability 0.7 to be rejected and probability 0.3 to be accepted. In this case, we generate 10 training samples with the sequence, 7 of which are labeled “0” (reject) and 3 of which are labeled “1” (accept) in expectation.. We generated 10,170 samples for stochastic language 1 (abbreviated SL1) with PA1 (Figure 11(a)) and sample length . For SL2 with PA2, we generated 20,460 samples with sample length . The learning procedure is described in Algorithm 2. We use the Gumbel softmax distribution in STto approximate the probability distribution of next state . To achieve this goal, we set the number of states to an even number and force the first half of the state to “reject” state and the second half of states to be “accept”. This allows us to ensure that the model models both “reject” and “accept” with the same number of states.
In the main part of the paper we report results when setting STto have states for SL=2. Here, we additionally present the results for in Figure 24, which yields comparable results, showing that STis not overly sensitive to the number of states. Is is however, helpful to have the same number of accept and reject states.
Appendix B Model Calibration
We address the problem of supervised multiclass sequence classification with recurrent neural networks. We follow the definitions and evaluation metrics as in (Guo et al., 2017). Given input and groundtruth label , a probabilistic classifier . The present the predicted class label and confidence (the probability of correctness) over classes and .
Definition of Calibration A model is perfectly calibrated if the confidence estimation equals the true probability, that is, , .
b.1 Evaluation of Calibration
Reliability Diagrams (DeGroot and Fienberg, 1983; NiculescuMizil and Caruana, 2005) visualise whether a model is over or underconfident by grouping predictions into bins according to their prediction probability. The predictions are grouped into interval bins (each of of size ) and the accuracy of samples wrt. to the ground truth label in each bin is computed as:
(4) 
where indexes all examples that fall into bin . Let be the probability for sample , then average confidence is defined as
(5) 
A model is perfectly calibrated if and in a diagram the bins would follow the identity function. Any derivation from this represents miscalibration.
Based on the accuracy and confidence measures, two calibration error metrics have been introduced (Naeini et al., 2015).
Expected Calibration Error (ECE). Besides the reliability diagrams, ECE is a convenient tool to have scalar summary statistic of calibration. It computes the difference between model accuracy and confidence as a weighted average across bins,
(6) 
where is the total number of samples.
Maximum Calibration Error (MCE) is particularly important in highrisk applications where reliable confidence measures are absolutely necessary. It measures the worstcase deviation between accuracy and confidence,
(7) 
For a perfectly calibrated classifier, the ideal ECE and MCE both equal to 0.
Dataset  IMDB  BIH 

Train  23  78 
Validation  2  8 
Test  25  21 
Max Length  2,956  187 
# Classes  2  5 
Type  Language  ECG 
Hyperparameters  IMDB  BIH 

Hidden dim.  256  128 
Learning rate  0.001  0.001 
Batch size  8  256 
Validation rate  1  1 
Maximum validations  20  50 
ST# states  2  5 
BBB  0.0  0.01 
BBB  3  3 
VD Prob.  0.1  0.05 
PE  ECE  MCE  
BIH 
LSTM  1.40  0.30  13.02 
Ensemble  1.51  0.22  21.90  
BBB  4.69  0.36  10.43  
VD 0.05  1.51  0.27  23.60  
ST  2.12  0.45  15.38  
IMDB 
LSTM  10.42  1.19  5.82 
Ensemble  10.56  1.26  5.87  
BBB  10.84  0.63  2.69  
VD 0.1  10.56  2.34  10.86  
ST  10.95  1.00  3.84  

b.2 Dataset and Hyperparameter details
For the experiments of Section 4.2, we provide an overview of the used datasets in Table 2 and give details on the different hyperparameters used in the experiments in Table 3. On BIH, we use the training / test split of (Kachuee et al., 2018), however we additionally split off 10% of the training data to use as a validation set. On IMDB, the original dataset consist of 25k training and 25k test samples, we split 2k from training set as validation dataset. The word dictionary size is limited to 5,000.
b.3 Additional Experiments
Here we report three groups of additional or ablation experiments: (1) All baseline method and STwith directly employing posttraining temperature scaling. (2) The tradeoff between predictive and calibrate performance with different dropout ratio in variational dropout (VD). (3) Additional calibration plots for the IMDB dataset.
b.3.1 Experiments with Temperature Scaling
For classification calibration experiments, the posthoc temperature scaling can also be used to calibrate models. However, please note, posthoc temperature scaling can not be used when we need to calibrate a model during training stages, for example, the tasks like DFA or PA extraction, reinforcement learning tasks.
Table 4 reports the results with temperature scaling where temperature is tuned on valid set. Among the models (Ensemble, BBB, VD) that can estimate uncertainty, STis very competitive, for example, the second best on both the BIH and the IMDB datsaet. While LSTM can achieve better predictive performance and sometimes better calibration performance, LSTM is not able to provide any uncertainty information.
b.3.2 Experiments with Different Dropout Rate
To gain a better understanding of how crucial the hyperparameter of VD is, we investigate the effect of the dropout probability during prediction with VD. We perform an experiment where we vary the dropout probability in the range of with increments of . In Figures 32 (a, b) we plot the result for BIH and IMDB, respectively, reporting PE, ECE and MCE for the various VD settings as well as ST.
On both datasets, for VD, the point of lowest PE does not necessarily coincide with the points of lowest MCE and/or ECE. For example on BIH, VD achieves the lowest PE when dropout is switched off (0.0), but then uncertainty cannot be measured. On the other hand, choosing the next lowest PE results in a high MCE. In contrast, STdirectly achieves good results for all three metrics. Similarly, on IMDB, at the point where VD has the lowest PE is also has highest MCE. In conclusion, VD requires careful tuning which comes with choosing a tradeoff between the different metrics, while STachieves good results directly, without any tuning.
b.3.3 Calibration Plots for the IMDB Dataset
Figure 29 shows the calibration plots for IMDB. For the binary classification task BBB achieves the best calibration performance and VD achieves the best predictive performance. It should be noted that STachieves the best tradeoff between predictive and calibration performance without doubling parameters (for BBB) and without tuning dropout rate for (VD).
Appendix C Regression
Calibration plays an important role in probabilistic forecasting. We consider a timeseries forecasting regression task using the individual household electric power consumption dataset.^{2}^{2}2http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption The data is preprocessed by following https://www.kaggle.com/amirrezaeian/timeseriesdataanalysisusinglstmtutorial The goal of the task is to predict the global active power at the current time (t) given the measurement and other features of the previous time step. The dataset was sampled at the time step of an hour (the original data are given in minutes), which leads to 34,588 samples. We split it 25,000/2,000/7,588 for training/validation/test. One LSTM layer with 100 hidden units is used for the baselines (LSTM, BBB and VD) and ST(the number of states is set to 10). The evaluation metric is mean squared error (MSE) and we use the model with lowest MSE on the validation dataset at test stage.
Figure 33 presents the performance. LSTM, VD and STperform very well at this task, achieving MSE close to zero. BBB performs worse than the other methods. For uncertainty estimation, BBB, VD and STare able to provide uncertainty information alongside the predictive score. BBB gives high uncertainty, while VD and STare more confident in their prediction, with SToffering the tightest uncertainty bounds.
Appendix D Reinforcement Learning
We explore the ability of STto learn the explorationexploitation tradeoff in reinforcement learning. To demonstrate this empirically, we evaluate STin the continuous control environment cartpole (Figure D) of OpenAI Gym (Brockman et al., 2016).
The objective of the task is to train an agent to balance a pole for 1 second (which equals 50 time steps) given environment observations . To keep the pole in balance, the agent has to move the pole either left or right, that is, the possible actions are left, right. If the chosen action keeps the pole in balance, the agent receives a reward of and continues; otherwise a reward of is given and the run stops.
The environment can be formulated as a reinforcement learning problem () where is a set of states. Each state consists of : cart position, : cart velocity, : angle position, and : angle velocity. is the set of actions, the state transition probability, and the reward. We consider two different setups. In the first setup, the cartpole
environment is fully observable Markov Decision Process (MDP setup) where the agent has full access to observation
, that is, . In the second and more difficult setup (POMDP), the environment is partially observable, where . For the latter, the agent cannot observe the state cart velocity and angular velocity information. It has to learn to approximate them using its recurrent connections, and thus needs to retain longterm history (Bakker, 2002; Gomez and Schmidhuber, 2005; Schäfer, 2008). The various RNNbased models are trained to output a distribution over actions at each time step , that is, , where is the set of policy. For selecting the next action, we consider two policies: (1) sampling: , and (2) greedy: . For all baselines, we use one LSTM layer and one dense layer with softmax to return the distribution over actions. Each layer has 100 hidden units. For VD we tuned the dropout rate and for BBB and . For STwe simply set the initial temperature value to , the number of possible states to the number of actions (), and the next action is directly selected based on the Gumbel softmax distribution over states.Results are presented in Figure 38. An important criteria for evaluating RL agents is the sample complexity (Malik et al., 2019), that is, the amount of interactions between agent and environment before a sufficiently good reward is obtained. For both environment setups and both policy types, STachieves a higher averaged cumulative reward given lower sample complexity. Moving from sampling to a greedy policy, STperformance slightly drops. This is easily explained by the inherent sampling process due to the Gumbel softmax. Interestingly, it seems it is this sampling process which allows STto exhibit a lower sampling complexity in the sampling setups compared to LSTM and VD. In contrast, LSTM and VD show poor performance for the greedy policy setups because they can no longer explore by sampling from the softmax. BBB consistently performs worse than the other methods and we conjecture that this is due to the much larger number of parameters of this model, leading to a worse sampling complexity. Moving from the MDP to the POMDP, the average accumulative reward naturally drops, as the agents receive less information, but STagain exhibits the best performance for both policy types.
Appendix E Ablation study on OutofDistribution Detection
Figure 39 depicts the results of an ablation study focusing on the number of states and whether or not the temperature was learned. The results are for the two IMDB OutofDistribution Detection (OOD) tasks from section 4.3 . The results indicate that a smaller number of states is sufficient to capture the STmodel’s uncertainty on outofdistribution data. Especially when the temperature parameter is learned during training (the green, solid line), STshows the best results. Increasing the number of states of ST, gives the model more capacity to model uncertainty. For 50 states, fixing the temperature to a constant values works better but does not reach the accuracy of STmodels with fewer states and learned temperature.
Comments
There are no comments yet.