Bayesian LSTMs in medicine

06/05/2017 ∙ by Jos van der Westhuizen, et al. ∙ University of Cambridge 0

The medical field stands to see significant benefits from the recent advances in deep learning. Knowing the uncertainty in the decision made by any machine learning algorithm is of utmost importance for medical practitioners. This study demonstrates the utility of using Bayesian LSTMs for classification of medical time series. Four medical time series datasets are used to show the accuracy improvement Bayesian LSTMs provide over standard LSTMs. Moreover, we show cherry-picked examples of confident and uncertain classifications of the medical time series. With simple modifications of the common practice for deep learning, significant improvements can be made for the medical practitioner and patient.



There are no comments yet.


page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Life and death decisions are commonplace in the medical domain. When making medical decisions, doctors mostly evaluate multiple parameters and make decisions based on a complex mixture of intuition and assumptions. Machine learning has demonstrated groundbreaking performance in recent studies (Krizhevsky et al., 2012; Mnih et al., 2015; Silver et al., 2016; Goodfellow et al., 2014) and shows promise as an augmentation to aid doctors in day-to-day care (Clifton et al., 2015; Lipton et al., 2015; Zhang, 2016)

. One of the most promising current techniques is deep learning. For the specific class of temporal data that is ubiquitous in medicine, the branch of deep neural networks called Recurrent Neural Networks (RNNs), has yielded some of the best results

(Lipton et al., 2015; Choi et al., 2015; Jagannatha and Yu, 2016; Harutyunyan et al., 2017).

Although RNNs and other temporal models have shown much promise in analyzing sequential medical data, the models don’t provide practitioners with a certainty measure of their decisions. Thus doctors have no quantitative measure of the importance they should place on the decisions made by their computational assistants. Clinicians typically determine the course of treatment given the current health status of the patient as well as some internal estimate of the outcome of possible future treatments. The effect of treatments for a given patient is non-deterministic (uncertain), and predicting the effect of a series of treatments over time compounds the uncertainty

(Bennett and Hauser, 2013). Uncertainty in medical decisions is of paramount importance.

Bayesian probability theory offers a mathematically grounded technique to reason about model uncertainty (Gal and Ghahramani, 2016). However, these Bayesian techniques are often accompanied by a prohibitive computational cost. Previous research has explored the benefits of Bayesian techniques in medicine (Temko et al., 2011; Kononenko, 2001; Meyfroidt et al., 2009; Murphy, 2012; Mani et al., 2014; Ghassemi et al., 2015; Guiza Grandas et al., 2006). However, these proposals do not harness the representative power exhibited by deep learning (Ongenae et al., 2013). Our work follows that of Gal (2015) to show that deep learning tools can be used as Bayesian models without changing the model for optimization.

But, do we not get confidence measures from the probabilities produced by the softmax function at the end of most neural networks? The probabilities obtained from Bayesian approaches is significantly different to the "probabilities" obtained from the softmax classifier

(Kendall et al., 2015). The softmax function provides estimates of the relative probabilities between classes, but not an overall measure of the model’s uncertainty (Gal and Ghahramani, 2016).

Our work demonstrates two key benefits of employing Bayesian deep learning: (i) an increase in the classification accuracy of medical signals, and (ii) a measure of confidence in the model decisions. Although conventional Bayesian approaches are computationally expensive, the implementation proposed here would enable online classification in a clinical setting.

2 Related work

Lipton et al. (2015) made use of LSTMs to diagnose patients with 128 different codes (one code for each medical condition). Similarly, Choi et al. (2015)

made use of gated recurrent units to predict medication and diagnosis codes. Both of these studies demonstrate the efficacy of LSTMs for sequential medical data, albeit on low-resolution (<0.0003 Hz) signals.

Bayesian Neural Networks (NNs) are a class of NNs which are able to model uncertainty (Denker and Lecun, 1990; MacKay, 1992)

. These models provide a variance (uncertainty) of the predictions by learning distributions over the weights. Often they are computationally expensive, increasing the number of model parameters without increasing model capacity significantly

(Kendall et al., 2015). Conventional Bayesian NNs mostly employ variational inference to approximate the posterior (Graves, 2011).

Dropout is a regularization technique commonly used in NNs to prevent overfitting and co-adaption of features (Srivastava et al., 2014)

. The technique entails removing a percentage of random units within a network during each iteration of stochastic gradient descent. The standard approach is to rescale the weights at test time through multiplication of the learned weights by the probability of the weights being present during training, known as

weight averaging.

Rather conveniently, dropout can be used as approximate Bayesian inference over the weights of a network

(Gal and Ghahramani, 2015)

, mitigating the computational complexity of Bayesian NNs. This is achieved by sampling from the network with random units removed at test time. Thus the NN does not require any additional parameters and a Bernoulli distribution is imposed over the weights. The samples can be considered as Monte Carlo samples obtained from the posterior distribution over models, giving rise to the name

Monte Carlo (MC) dropout. Using RNNs with MC dropout has seen success in (Gal, 2015) and in Sennrich et al. (2016).

Long Short-Term Memory (LSTM) RNNs are easier to train and perform better than standard RNNs (Hochreiter and Schmidhuber, 1997). Here we aim to demonstrate the efficacy of Bayesian LSTMs in medicine to improve accuracy and decrease the uncertainty in the final decisions that doctors make. Fortunato et al. (2017) proposed a technique for obtaining uncertainty estimates using an adaptation of Bayes by Backprop (Graves, 2011). Although the proposed technique yields accuracies superior to the technique in Gal (2015), we choose to employ techniques proposed by the latter, which requires a smaller adaptation of commonly used techniques.

The Physionet/Computation in Cardiology 2016 Challenge provides an appropriate dataset for benchmarking the performance of LSTMs (Liu et al., 2016; Clifford et al., 2016). This comprehensive dataset was recently collected, is multi-center, and has multiple reported performance scores. The dataset comprises 4,430 heart sound recordings lasting from several seconds to over 100s with a resolution of 2 kHz. The data have long and short-term features paramount for classification of the signal. Moreover as detailed in Springer et al. (2016)

accurate classification of these signals is vital in developing communities. Among the top performing techniques for the official challenge were convolutional NNs, an ensemble of support vector machines, regularized NNs, and random forests.

Harutyunyan et al. (2017) proposed an easy to use benchmark system for medical data that is based on the Medical Information Mart for Intensive Care (MIMIC-III). The benchmark includes four different medical tasks based on low-resolution data. However, owing to more information being available in medical signals collected at higher resolution we feel it is important to also benchmark temporal models on the latter.

3 Methods

The LSTM implemented is based on the model described in Hochreiter and Schmidhuber (1997)

and implemented in Tensorflow

(Abadi et al., 2015). Each cell in the LSTM has input, output, forget, and input modulation gates and g.


The internal state is referred to as cell and is updated additively. The non-linear sigmoid activation is represented by , and and are the input and hidden weight matrices respectively with biases . We re-parameterize the model to have a single weight matrix for layer . For a specific layer, the input to each gate’s non-linearity is then computed by the single matrix multiplication:


with the resulting vector partitioned into the sum terms for input to the non-linearities in Equation 1. This results in a single distribution being placed over one weight matrix when applying dropout. The implication of the single weight matrix is a faster forward-pass with slightly diminished results (Gal, 2016).

3.1 Bayesian LSTM

We perform approximate inference in a Bayesian LSTM (Gal and Ghahramani, 2016) by using dropout (Srivastava et al., 2014). Therefore, dropout can be considered as a way of getting samples from the posterior distribution of models. This technique is linked to variational inference in a Bayesian NN with Bernoulli distributions over the network’s weights (Gal and Ghahramani, 2016). We leverage this method to perform Bayesian inference with LSTMs.

We are interested in finding the posterior distribution of the LSTM weights, , given the observed labels Y, and data X.


This posterior distribution is not tractable in general, and we use variational inference to approximate it (Kendall et al., 2015; Gal and Ghahramani, 2016; Denker and Lecun, 1990; Graves, 2011). This allows us to learn over the network’s weights, , by minimizing the reverse Kullback Leibler (KL) divergence between this approximating distribution and the full posterior;


where is a distribution over matrices whose columns are randomly set to zero. For the LSTM, these matrices, (Equation 2), are all the weights on a single layer and each matrix has dimensions . can be defined as:


given some probabilities and matrices

as variational parameters. The binary variable

corresponds to the output of a unit in layer being dropped. Note that we can left multiply the matrices with a similar diagonal matrix in Equation 5 to apply dropout over the rows (unit inputs).

Given the LSTM definitions in 1, we can re-write the operation (omitting biases for brevity) as a function :


where is the hidden unit memory from the previous time step and is determined by a recursive function on . The output can be defined as . This LSTM can be viewed as probabilistic model by regarding the weights,

to be random variables (following normal prior distributions). The functions are written as

and to emphasize the dependence on . Approximating the posterior distribution we have:

with . We approximate this via MC integration with a single sample:

resulting in an unbiased estimator of each sum term. Our minimization objective then becomes:


From Equation 5, we define our approximating distribution to factorize over the weight matrices and their columns in (Gal, 2015). For each layer every weight matrix column the approximating distribution is:


with variational parameter (column vector), small , and the dropout probability provided in advance. We optimize over the variational parameters of the random weight matrices; these correspond to the LSTM weight matrices in the standard view. The KL term in Equation 7 can be approximated as , summing over the variational parameters of each weight matrix in our model (each composed of weight vectors ) (Gal and Ghahramani, 2016).

Evaluating the model output with sample corresponds to randomly zeroing (masking) columns in each weight matrix during the forward pass – i.e. performing dropout. Further, our objective is identical to that of the standard LSTM. In the LSTM setting with a sequence input, each weight matrix row is randomly masked.

Predictions can be approximated using the standard forward pass for LSTMs, i.e., propagating the mean of each layer to the next (standard dropout approximations), or by approximating the posterior with for a new input ,

with , i.e. by performing dropout at test time and averaging the results (MC dropout).

Gal (2015) emphasizes that for each sample a single realization is sampled, and that element in the sequence is passed through a function with the same parameters . This is referred to as Variational dropout. Intuitively, having the same dropout mask per sequence element makes sense from a recurrent and Monte Carlo integration approximation perspective. However, empirically we found that naive dropout, with different samples at each time step still improves the classification performance when using MC dropout compared to the standard dropout approximation.

When sampling a different for each recursion function (i.e. each time step in ) in Equation 7, the function is no longer strictly recursive. At each level of recursion a different function is applied to the element of . However, if an optimum is reached during training, each sample would produce a similar function , making Equation 7 approximately recursive. With naive dropout the minimization objective becomes


where represents an arbitrary dropout mask for the linear mapping defined earlier. represents the number of elements in . The first term in Equation 9 pushes the posterior towards a Dirac delta function in order to have the function be the same at each time step.

The difference between the variational and naive dropout approaches is depicted in Figure 1. The distributions of the hidden outputs (Equation 6

) after dropout (sampled parameters) are plotted over 150 epochs for a model trained on the MNIST dataset described in Section

3.2. The graphs show the percentiles of the hidden layer outputs over all time steps for the same arbitrary input sample at each epoch. Although both approaches result in similar performance (Table 1

), the converged hidden output distributions are quite different. In accordance with the hypothesis above, the naive approach results in a narrow distribution on the first layer with a standard deviation of 0.1224 compared to the variational approach (0.2818). The second layers in both approaches seem to counter the distributions of the first layers – the wide range of parameter exploration in the first layer of the variational approach has a concurrent narrow band of exploration in the second layer. During experimentation, it was found that the distribution of the variational approach is the same for any training simulation, where the distributions over time for the naive approach would vary between different training simulations.

(a) Naive dropout
(b) Variational dropout
Figure 1: Hidden unit output distributions for the naive and variational dropout approaches. From top to bottom, the lines represent the maximum, , , , , , percentiles, and the minimum. The output values exceed the (-1,1) range due to the Tensorflow implementation of dropout scaling the weights by during training (Abadi et al., 2015).

Intuitively the variational dropout should be easier to train than the naive approach because the naive approach is not strictly recursive during the initial stages of training. The inherent leakiness of the LSTM memory (Neil et al., 2016) could be one reason why the LSTMs converge during training with naive dropout. The leakiness of the network results in bad samples from the posterior to be leaked (forgotten over time).

3.2 Experimental implementation

We demonstrate the efficacy of Bayesian LSTMs by means of 5 datasets described in the following sections. The same LSTM model with a different architecture was used for each dataset (see the following sections for details). The outputs of the last hidden layer were linearly mapped to the output dimension. The resulting vectors were then average pooled before being subjected to the softmax function. A validation set was used in each case for early stopping of training. Dropout was used on only the input and output LSTM connections. Optimization was performed with Adam (Kingma and Ba, 2014), a learning rate of 0.01, and a minibatch size of 256. The standard and Bayesian LSTMs referred to hereafter are the same models, but for the Bayesian LSTM, MC dropout was used during testing to provide a measure of uncertainty.

3.2.1 Mnist

The MNIST handwritten digit dataset (LeCun et al., 1998) provided by Tensorflow (Abadi et al., 2015) was processed in scanline order (Cooijmans et al., 2016). The model architecture was 2 hidden layers with 128 units in each. A dropout value () of 0.2 was used.

3.2.2 MIT-BIH arrhythmia dataset

This dataset contains 48 half-hour excerpts of electrocardiogram (ECG) recordings from 47 patients (Moody and Mark, 2001; Goldberger et al., 2000). The 5 heartbeat classes selected from the database were: normal beat, right bundle branch block beat, left bundle branch block beat, paced beat, and premature ventricular fibrillation. Single heart beats were extracted using the Pan-Tompkins algorithm (Pan and Tompkins, 1985), which has a reported accuracy of 0.99 on this dataset. The resulting dataset contained 106,848 samples of 216 time steps at 360 Hz. A random split of 50:40:10 (train:test:validation) was used. A model with a single hidden layer of 128 units and a dropout probability of 0.3 was used.

3.2.3 Physionet/Compute in cardiology challenge 2016

Of the 4,430 phonocardiogram (PCG) recordings in this dataset (see Section 2), 3,126 were provided for training. The 301 validation samples (selected by the challenge organizers) were extracted from the training dataset. Each PCG signal was normalized independently to have a zero mean and unit standard deviation. Thereafter each signal was decimated to a frequency of 1 kHz. Owing to LSTMs not being able to handle long sequences (Neil et al., 2016), we segmented the signals into samples with a length of at most 1000 time steps.

The data were provided with 2 classes; normal and abnormal heart beats. During online evaluation for the challenge, the models are allowed to classify a signal into a third class; noisy, resulting in a lower penalty on the model’s score compared to an incorrect classification. To determine the class of a signal we first averaged the softmax probabilities over all the segments of the signal. For the standard LSTM we then classified a signal as noisy if the averaged softmax probabilities were between 0.45 and 0.55. For the Bayesian LSTM the signal was classified as noisy if the standard deviation (averaged over all the signal’s segments) was higher than 0.13.

The online submission imposed a strong computational constraint on the model, with the virtual machine for the scoring having a single CPU core and 2GB of RAM. A model with 2 hidden layers of 128 units and a dropout probability of 0.25 was used. Model performance was evaluated by means of online submission that returns a score based on the specificity and sensitivity (Clifford et al., 2016).

3.2.4 Neonatal intensive care unit dataset

This dataset contains the first 48 hours of vital signs for 3 neonatal intensive care unit (NICU) patients collected as part of the study by Sortica da Costa et al. (2017). The signals used for analysis were ECG, blood pressure, and oxygen saturation. The data were segmented into samples with a length of 200 time steps at 60 Hz, resulting in a total of 134,812 samples from 3 different classes: normal, dying, and intraventricular hemorrhage. Oxygen saturation values are the second-long averages, and clinicians were consulted to establish factors that scale the inputs to range from approximately 0 to 1. The employed model had a single hidden layer of 64 units and a dropout probability of 0.1. A random split of 50:40:10 was used.

3.2.5 Traumatic brain injury dataset

Data were collected from traumatic brain injury (TBI) patients as part of a larger study directed by the Department of Clinical Neurosciences at Addenbrookes. The dataset contains 19 variables recorded for 101 patients of which 34 were females, and the age ranged from 15 to 76. The dynamic variables comprised 5s averaged values for intracranial pressure (ICP), cerebral perfusion pressure, arterial blood pressure, heart rate, respiratory rate, systolic and diastolic blood pressure; the 5s amplitudes of arterial blood pressure, respiratory rate, and respiratory pulse; the minimum and maximum of ICP over the 5s; the peak-to-peak timing values for arterial blood pressure and ICP; the slow wave ICP; and the pressure-reactivity index values (Czosnyka et al., 1997). The static variables include age and gender. The duration of the recorded signals ranged from 1h to 12 days. The patients were classified according to the Glasgow Outcome Scale (GOS) (Jennett and Bond, 1975), providing a number between 1 and 5 to patients based on their health status 6 months after admission to the intensive care unit, with 5 being a good outcome and 1 being death. This dataset only contained patients with a GOS of 1 or 5. A random split of 50:40:10 was used. The model had a single hidden layer of 128 units and a dropout probability of 0.4. This dataset has a lower resolution than those introduced earlier and is used to demonstrate that the Bayesian approach is also beneficial for lower resolution longitudinal data.

4 Results

Table 1 summarizes the results for the datasets analyzed in this study. The values shown are the averages for 10 runs. For the Bayesian LSTM 100 samples were used for MC dropout. Using MC dropout at test time improved the model accuracy on all the datasets, even though naive dropout was employed. In brackets we show the accuracies yielded for the variational dropout approach on the MNIST and MIT-BIH dataset. The variational approach significantly improved the accuracies for the MIT-BIH dataset, but yielded lower accuracies for the MNIST dataset. For the best model on the Physionet dataset the sensitivity and specificity values obtained were 0.675 and 0.880 for the standard LSTM, and 0.707 and 0.889 for the Bayesian LSTM respectively.

Dataset Standard LSTM Bayesian LSTM MNIST 0.9889 (0.987) 0.9891 (0.9879) Physionet 2016 111Online score, not accuracy.
Values in brackets are the accuracies using variational dropout.
0.778 0.798
MIT-BIH 0.9815 (0.98463) 0.9834 (0.98468) NICU 0.9972 0.9979 TBI 0.9449 0.9521
Table 1: Model Accuracies

As mentioned before, using a Bayesian LSTM for the classification of medical time series provides the imperative benefit of a confidence measure alongside the estimated class. In Figure 2 we juxtapose confident and uncertain Bayesian LSTM classified medical signals from the datasets analyzed in this study. It should be noted that for standard LSTMs, only the estimated class is produced as output. The figure shows that the model is uncertain when the signals look abnormal or noisy. The uncertainty value indicates when practitioners should further investigate signals and could help researchers understand how LSTM models work.

(b) Physionet PCG
(d) NICU
(e) TBI
Figure 2: Examples of confident classifications (top row) and uncertain classifications (bottom row) by the Bayesian LSTM on the different datasets. The medical samples displayed have been normalized and segmented. The NICU samples comprise the ECG, blood pressure (BP), and oxygen saturation signals (SpO2). Refer to Section 3.2.5 for details about the TBI signals.

5 Discussion

The model yielded performance slightly below the benchmark performance on the Physionet 2016 challenge dataset (Clifford et al., 2016). We believe that the LSTMs have the capacity to compete with the benchmark models for the Physionet 2016 Challenge. However, LSTMs are known to have poor performance on signals longer than 1000 time steps (Neil et al., 2016). The original signals had to be split into subsegments of 1000 steps each. When splitting medical signals such as these, the subsegments of the original signal could be indicative of a different class, and assigning them with the same label as the original will confuse the model during training. Moreover, the strong computational constraints (single CPU core, and processing time limit) imposed by the competition does not allow for an LSTM model that has sufficient explanatory power. The model performance on the MNIST dataset is similar to that found in Cooijmans et al. (2016) and Zhang et al. (2016), 0.989 and 0.981 respectively.

As a practical guide to the implementation of dropout, our study found that ideal keep probabilities should be larger than 0.8 (dropout < 0.2) for LSTMs. LSTMs were found to converge to poor optimums and even overfit strongly when using keep probabilities around 0.5. Although higher keep probabilities result in weaker Bayesian uncertainties for the proposed implementations, the yielded variance still provides sufficient measures of confidence. Moreover, MC dropout is more computationally expensive than standard weight averaging, but owing to the samples being independent, it is a highly parallelizable method.

6 Conclusion

This study showed that a simple adaptation of the conventional deep learning technique for time series can (i) provide a vital additional output for quantifying model decisions, and (ii) improve model accuracy. Furthermore, we showed examples of applying this simple LSTM adaptation to medical data, where the contribution from a model confidence measure is greatly beneficial. In this work, we only focused on epistemic uncertainty - model uncertainty which can be explained away given enough data (Kendall and Gal, 2017). Methods for quantifying aleatoric uncertainty – uncertainty inherent in observations could also provide valuable benefits to the medical machine learning field.


  • Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., S. Corrado, G., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and Zheng, X. (2015). TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from
  • Bennett and Hauser (2013) Bennett, C. C. and Hauser, K. (2013). Artificial intelligence framework for simulating clinical decision-making: A Markov decision process approach. Artificial Intelligence in Medicine, 57(1):9–19.
  • Choi et al. (2015) Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F., Denny, J. C., Malin, B. A., and Sun, J. (2015). Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. arXiv:1511.05942 [cs].
  • Clifford et al. (2016) Clifford, G. D., Liu, C., Moody, B., Springer, D., Silva, I., Li, Q., and Mark, R. G. (2016). Classification of normal/abnormal heart sound recordings: The physionet/computing in cardiology challenge 2016. Proceedings of the Computing in Cardiology, pages 609–612.
  • Clifton et al. (2015) Clifton, D. A., Niehaus, K. E., Charlton, P., and Colopy, G. W. (2015). Health Informatics via Machine Learning for the Clinical Management of Patients. Yearbook of medical Informatics, 20(1):38–43.
  • Cooijmans et al. (2016) Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, Ç., and Courville, A. (2016). Recurrent batch normalization. arXiv preprint arXiv:1603.09025.
  • Czosnyka et al. (1997) Czosnyka, M., Smielewski, P., Kirkpatrick, P., Laing, R. J., Menon, D., and Pickard, J. D. (1997). Continuous assessment of the cerebral vasomotor reactivity in head injury. Neurosurgery, 41(1):11–19.
  • Denker and Lecun (1990) Denker, J. S. and Lecun, Y. (1990).

    Transforming neural-net output levels to probability distributions.

    In Proceedings of the 3rd International Conference on Neural Information Processing Systems, pages 853–859. Morgan Kaufmann Publishers Inc.
  • Fortunato et al. (2017) Fortunato, M., Blundell, C., and Vinyals, O. (2017). Bayesian Recurrent Neural Networks. arXiv:1704.02798 [cs, stat].
  • Gal (2015) Gal, Y. (2015). A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv:1512.05287 [stat].
  • Gal (2016) Gal, Y. (2016). Uncertainty in Deep Learning. PhD thesis, PhD thesis, University of Cambridge.
  • Gal and Ghahramani (2015) Gal, Y. and Ghahramani, Z. (2015). Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv:1506.02158 [cs, stat].
  • Gal and Ghahramani (2016) Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. pages 1050–1059.
  • Ghassemi et al. (2015) Ghassemi, M., Pimentel, M. A., Naumann, T., Brennan, T., Clifton, D. A., Szolovits, P., and Feng, M. (2015). A Multivariate Timeseries Modeling Approach to Severity of Illness Assessment and Forecasting in ICU with Sparse, Heterogeneous Clinical Data. In AAAI, pages 446–453.
  • Goldberger et al. (2000) Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., and Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet Components of a New Research Resource for Complex Physiologic Signals. Circulation, 101(23):e215–e220.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative Adversarial Nets. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc.
  • Graves (2011) Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356.
  • Guiza Grandas et al. (2006) Guiza Grandas, F., Ramon, J., and Blockeel, H. (2006). Gaussian processes for prediction in intensive care. In Gaussian Processes in Practice Workshop, pages 1–4.
  • Harutyunyan et al. (2017) Harutyunyan, H., Khachatrian, H., Kale, D. C., and Galstyan, A. (2017). Multitask Learning and Benchmarking with Clinical Time Series Data. arXiv:1703.07771 [cs, stat].
  • Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8):1735–1780.
  • Jagannatha and Yu (2016) Jagannatha, A. N. and Yu, H. (2016). Bidirectional RNN for Medical Event Detection in Electronic Health Records. In Proceedings of NAACL-HLT, pages 473–482.
  • Jennett and Bond (1975) Jennett, B. and Bond, M. (1975). Assessment of outcome after severe brain damage: A practical scale. The Lancet, 305(7905):480–484.
  • Kendall et al. (2015) Kendall, A., Badrinarayanan, V., and Cipolla, R. (2015). Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding. arXiv:1511.02680 [cs].
  • Kendall and Gal (2017) Kendall, A. and Gal, Y. (2017). What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? arXiv:1703.04977 [cs].
  • Kingma and Ba (2014) Kingma, D. and Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs].
  • Kononenko (2001) Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the art and perspective. Artificial Intelligence in medicine, 23(1):89–109.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In Pereira, F., Burges, C. J. C., Bottou, L., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc.
  • LeCun et al. (1998) LeCun, Y., Cortes, C., and Burges, C. J. (1998).

    The mnist database of handwritten digits.

  • Lipton et al. (2015) Lipton, Z. C., Kale, D. C., Elkan, C., and Wetzell, R. (2015). Learning to Diagnose with LSTM Recurrent Neural Networks. arXiv:1511.03677 [cs].
  • Liu et al. (2016) Liu, C., Springer, D., Li, Q., Moody, B., Juan, R. A., Chorro, F. J., Castells, F., Roig, J. M., Silva, I., Johnson, A. E., et al. (2016). An open access database for the evaluation of heart sound algorithms. Physiological Measurement, 37(12):2181.
  • MacKay (1992) MacKay, D. J. (1992).

    A practical bayesian framework for backpropagation networks.

    Neural computation, 4(3):448–472.
  • Mani et al. (2014) Mani, S., Ozdas, A., Aliferis, C., Varol, H. A., Chen, Q., Carnevale, R., Chen, Y., Romano-Keeler, J., Nian, H., and Weitkamp, J.-h. (2014). Medical decision support using machine learning for early detection of late-onset neonatal sepsis. Journal of the American Medical Informatics Association, 21(2):326–336.
  • Meyfroidt et al. (2009) Meyfroidt, G., Güiza, F., Ramon, J., and Bruynooghe, M. (2009). Machine learning techniques to examine large patient databases. Best Practice & Research Clinical Anaesthesiology, 23(1):127–143.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. (2015).

    Human-level control through deep reinforcement learning.

    Nature, 518(7540):529–533.
  • Moody and Mark (2001) Moody, G. B. and Mark, R. G. (2001). The impact of the mit-bih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine, 20(3):45–50.
  • Murphy (2012) Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
  • Neil et al. (2016) Neil, D., Pfeiffer, M., and Liu, S.-C. (2016). Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems 29, pages 3882–3890. Curran Associates, Inc.
  • Ongenae et al. (2013) Ongenae, F., Van Looy, S., Verstraeten, D., Verplancke, T., Benoit, D., De Turck, F., Dhaene, T., Schrauwen, B., and Decruyenaere, J. (2013). Time series classification for the prediction of dialysis in critically ill patients using echo state networks. Engineering Applications of Artificial Intelligence, 26(3):984–996.
  • Pan and Tompkins (1985) Pan, J. and Tompkins, W. J. (1985). A Real-Time QRS Detection Algorithm. IEEE Transactions on Biomedical Engineering, BME-32(3):230–236.
  • Sennrich et al. (2016) Sennrich, R., Haddow, B., and Birch, A. (2016). Edinburgh neural machine translation systems for wmt 16. arXiv preprint arXiv:1606.02891.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.
  • Sortica da Costa et al. (2017) Sortica da Costa, C., Placek, M. M., Czosnyka, M., Cabella, B., Kasprowicz, M., Austin, T., and Smielewski, P. (2017). Complexity of brain signals is associated with outcome in preterm infants. Journal of Cerebral Blood Flow & Metabolism, page 0271678X16687314.
  • Springer et al. (2016) Springer, D. B., Brennan, T., Ntusi, N., Abdelrahman, H. Y., Zühlke, L. J., Mayosi, B. M., Tarassenko, L., and Clifford, G. D. (2016). Automated signal quality assessment of mobile phone-recorded heart sound signals. Journal of Medical Engineering & Technology, 40(7-8):342–355.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res., 15(1):1929–1958.
  • Temko et al. (2011) Temko, A., Thomas, E., Marnane, W., Lightbody, G., and Boylan, G. (2011). EEG-based neonatal seizure detection with Support Vector Machines. Clinical Neurophysiology, 122(3):464–473.
  • Zhang et al. (2016) Zhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R. R., and Bengio, Y. (2016). Architectural complexity measures of recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1822–1830.
  • Zhang (2016) Zhang, Z. (2016). When doctors meet with AlphaGo: Potential application of machine learning to clinical medicine. Ann Transl Med, 4(6).