AI Assisted Annotator using Reinforcement Learning

Healthcare data suffers from both noise and lack of ground truth. The cost of data increases as it is cleaned and annotated in healthcare. Unlike other data sets, medical data annotation, which is critical to accurate ground truth, requires medical domain expertise for a better patient outcome. In this work, we report on the use of reinforcement learning to mimic the decision making process of annotators for medical events, to automate annotation and labelling. The reinforcement agent learns to annotate alarm data based on annotations done by an expert. Our method shows promising results on medical alarm data sets. We trained DQN and A2C agents using the data from monitoring devices that is annotated by an expert. Initial results from these RL agents learning the expert annotation behavior is promising. The A2C agent performs better in terms of learning the sparse events in a given state, thereby choosing more right actions compared to DQN agent. To the best of our knowledge, this is the first reinforcement learning application for the automation of medical events annotation, which has far-reaching practical use.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

03/09/2021

Challenges for Reinforcement Learning in Healthcare

Many healthcare decisions involve navigating through a multitude of trea...
03/20/2022

Learning from Multiple Expert Annotators for Enhancing Anomaly Detection in Medical Image Analysis

Building an accurate computer-aided diagnosis system based on data-drive...
03/28/2020

Learning medical triage from clinicians using Deep Q-Learning

Medical Triage is of paramount importance to healthcare systems, allowin...
10/28/2020

Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in First-person Simulated 3D Environments

First-person object-interaction tasks in high-fidelity, 3D, simulated en...
07/03/2020

A Conceptual Framework for Externally-influenced Agents: An Assisted Reinforcement Learning Review

A long-term goal of reinforcement learning agents is to be able to perfo...
06/25/2019

An Unsupervised Bayesian Neural Network for Truth Discovery

The problem of estimating event truths from conflicting agent opinions i...
06/11/2021

Interpreting Expert Annotation Differences in Animal Behavior

Hand-annotated data can vary due to factors such as subjective differenc...

1 Introduction

The Healthcare domain has seen a dramatic shift in machine learning and computational methods in recent years with the rise of deep learning methods and medical data availability. Data is the fundamental currency for solving many healthcare problems using computational methods. While volumes of medical data are becoming increasingly available, such big data has its own unique challenges. Medical data suffers from data privacy, sparsity, noise, quality, missing data, heterogeneity, and ground truth availability

[1, 2, 3, 4] . Deep learning (DL) methods compared with traditional machine learning are scalable and efficient in learning the data patterns when provided with sufficient data. Medical devices such as anesthesia machines, ventilators, and monitoring systems are a rich source of data which help in processing, identifying, and alerting events that in turn serve as the basis for optimal decision making. Alarm fatigue is a well-known issue in medical alarms that is caused by the threshold based alarm classification approach that is used in all medical monitoring systems [5, 6]. To build smart alarm systems and use AI algorithms in practice, we must address the problem of false alarms, which require a significant volume of correctly annotated data. It is expensive, time consuming, and requires domain expertise to get trustworthy annotations of medical monitoring data [6].

Reinforcement Learning (RL) is a computational approach to learning from interactions that is goal focused and has gained lot of attention in the last five years [7]. Most of the techniques that use RL are Model-Free (MF) approaches, where there are no assumptions of the environment or data samples required to learn a policy as seen in game playing [8] that reached human level performance. Such approaches are often flexible and learn complex policies effectively, but require many trials and excessive training time for convergence. We report on the development and implementation of a novel approach that uses RL to mimic a human expert for annotation of alarms generated by monitoring devices as either true critical alarms or non-alarms. The remainder of the paper is organized as follows: Section 2 gives a review of related RL work in the medical domain and false alarm detection. Section 3 describes data, preprocessing, methods, and algorithms used in this work. Section 4 gives the details of experimental setup and results. Section 5 is the discussion and future work.

2 Related Work

Recent progress in RL and deep RL techniques has paved the way for medical data applications. Electronic Health Record (EHR) and Electronic Medical Record (EMR) health systems have matured over the past decades. The data generated from various health systems are yet to be tapped for their full potential. Many traditional approaches have been used for detecting false alarms that depend on feature engineering and availability of ground truth (labelled) data [5, 9, 10, 11, 12]. Although traditional approaches of false alarm detection have reasonable model performance, they have many limitations such as narrow focus on one alarm/signal type (e.g., arrhythmia) which cannot scale and generalize for various alarm types. Distant supervision methods alleviate the limited ground truth problem but fall short on scaling to large volumes and generalizing to a wide variety of tasks [6]. Applications of learning from experience via decision making using RL such as ventilation weaning protocols and customized drug administration strategies have proven to be effective [13, 14, 15, 16]. The application of RL to complex real-world examples where state and action space are high dimensional and generalization of the resulting learning to new experiences is very process intensive and challenging.

The first advancement in deep RL combined RL and deep neural networks to achieve complex state-action space such as Atari 2600 games reached professional human level scores without any prior domain knowledge of the game

[8]. The Deep Q-Network (DQN) algorithm combined reinforcement learning with a deep neural network to achieve a novel agent that can learn a complex state-action space that achieves high accuracy outcomes. Deep convolution networks were used as function approximators to evaluate the optimal action-value function [8] in DQN. The challenge of divergence from using nonlinear function approximators such as neural networks is addressed in this approach by two methods. One is by using experience replay which randomized the data thereby removing correlations of sequential data (following i.i.d – independently and identically distributed ). Two by updating the Q function at regular intervals rather than at every time step as was done in previous works. More recent RL work, Advantage Actor-Critic (A2C), learns the approximation of both policy and value functions and the agent critically uses the value function to update the actions policy [7, 17]. The advantage value in A2C determines the value of a specific action compared to an average action value at a given state. We see the value add of the advantage function in A2C evaluation as compared to DQN in our results. In our current work, we report on a generic data-driven AI assisted annotation RL framework that can be applied for any medical events.

3 Methods

RL is a prominent branch of AI, centered around an environment that senses, observes, and interacts with an agent in the environment. The environment, in turn, either rewards or penalizes the agent to attain a specific goal. RL is especially helpful in implementing automation of tasks which require human goal oriented action and sequential decision making.

In order to keep the RL problem-space simple, we merged all the ground truth into alarms and non-alarms annotated by an expert. We trained simple DQN and A2C agents to learn the alarms as true alarm and non-alarm based on the state represented by patient physiological signals generated by monitoring devices as seen in Figure 1.

(a)
Figure 1: Overview of data driven RL annotation framework for medical events

Our proposed RL approach can learn from the decision making of the domain expert without any assumptions of the system or any domain expertise. Once the RL agent reaches reasonable performance, we can replace the human expert with the RL agent to annotate the data and have a human in the loop to validate the annotations output by the RL agent.

3.1 Datasets

Good quality datasets containing annotations are critical to the development of deep learning models. We used the multi-phasic Push Electronic Relay for Smart Alarms for End User Situational Awareness (PERSEUS) program’s data hosted by Brown University’s digital archive. This data was generated from an adult Emergency Department (ED) for a regional referral medical facility and level I trauma center using patient monitoring devices for a 15-bed urgent care area in the ED. The PERSUES dataset, containing 12 months of data, is in its original .json format, de-identified and publicly available [18]. Each monitoring device data for a period of 24 hours is recorded in a single file . The following signals are recorded in each file:

  • Electrocardiogram waveform (single lead EKG , Lead II) at 250Hz

  • Pulse oximetry waveform (PPG) at 125Hz

  • Vital signs (heart rate (HR), respiratory rate (RR), systolic blood pressure (SBP), diastolic blood pressure (DBP), mean arterial blood pressure (MAP) and peripheral capillary oxygen saturation (SPO2))

  • Alarm messages (institution-specified alarms)

Kobayashi et al (2018) as part of the PERSEUS program developed subsets with annotation for experimental (non-clinical) research known as Adjudicated / Annotated Telemetry signals for Medically Important and Clinically Significant events [ATOMICS], which are used in this research [18]. Three non-consecutive weeks of red alarms data are annotated by Kobayashi et al (2018) for clinical significance and severity as seen below.

  • Clinical significance

    • Alarm messages (Clinically significant (improvement or deterioration))

    • No clinical significance

    • Indeterminate clinical significance

  • Clinical severity

    • Emergent

    • Urgent

    • Non-urgent

    • Indeterminate

The annotations of red alarms are based on EKG, PPG/SPO2, and BP signals from the monitoring devices. The subset data streams consist of 10-minute slices surrounding the individual alarm event with 5 minutes data prior and 5 minutes data post the alarm. We used the ATOMICS-1 dataset for training various RL agents and the ATOMICS-2 dataset for testing the RL agents developed as part of this research.

3.2 Preprocessing data

All the data is preprocessed and resampled in seconds and milliseconds. Data is imputed using mean to forward fill the resampled data. ATOMICS-1 and ATOMICS-2 datasets for 15-bedside monitors with vitals, annotations, and alarms are pre-processed. Alarms and annotations are converted to one-hot encoding for processing. The annotations are divided into two categories of actions (alarms/non-alarms) to simplify the problem space. All the clinically significant and severe alarms (emergent, urgent) are categorized as an alarm and the indeterminate and non-urgent events as non-alarms. The three pre-processed data sets: vitals, alarms, and annotations are then merged using a left join to get a flattened file structure to be used during training.

The data is highly imbalanced and sparse for critical clinically significant alarm events. Each critical alarm is surrounded by 600 non-alarms events before and after. This gives an imbalance ratio of 1:1200 for every alarm there are 1200 non-alarms. To rectify this imbalance, we used the following downsampling techniques: n-0, n-1, n-3, n-5, n-10, and mixed. In n-0 downsampling we keep only alarm data. In n-1 we keep 1 non-alarm before and after an alarm. Similarly, in n-3, n-5, and n-10, we keep 3, 5, and 10 surrounding non-alarms respectively. Mixed (0,1,3,5,10) is a random sampling of all of these strategies combined.

3.3 Problem formulation

A Markov Decision Process (MDP) for our alarm annotation RL problem is defined by:

  • A finite state space S at each time step t the environment transitions to next state .

    is a vector of six physiological variables described above at a given time

    .

  • An action space A where an agent takes an action at each time step which influences . Actions are alarm and non-alarm (1,0)

  • A scalar reward value of 1 for a non-alarm, 10 for a critical clinically significant alarm, and zero for wrong choice.

The goal of the RL agent is to maximize its expected perceived reward by using known examples to learn an optimal policy.

3.4 Learning an optimal policy

Learning the best mapping (Q-function) between actions and states is the essence of reinforcement learning. The optimal actions are learned primarily by two methods-value based and policy based. We have a simple Q-table that is modeled to learn the behavior of the expert annotator for the environment. Although the Q-table achieves 100% accurate results, it is limited for small state space problems and is not a scalable solution for big-data medical state space. We use a deep neural network to approximate the value of the Q-function which is more scalable and generalizable solution using both value and policy based approaches. We modeled two functional approximators using a DQN with experience replay (value based [8]) and Actor-Critic networks (policy based [7, 17]) for modeling the expert behavior to annotate the critical alarm events. The DQN network takes in all the six physiological variables as described above as an input, and outputs a Q-value for each action (no-alarm, alarm) as shown in Equation 1

. The parameters are updated after every 10 steps of training within each epoch, with a batch size of 8, learning rate of 0.001, and Adam optimizer. The action (

a) selection for both methods is based on -greedy with a starting value of 1 and annealed to 0.01 with a decay factor of 0.99975.

(1)

The optimal policy for the DQN method after k iterations is given by:

(2)

The second method that we used to train our agent is A2C that has two networks one to learn the advantage value of taking an action (actor network) given a state (s) as shown in Equation 3 and the second network to learn the goodness of the action (critic network) as seen in the DQN network shown in Equation 1.

(3)

The optimal policy for the A2C method after k iterations is given by:

(4)

Both actor and critic networks are updated every 10 time steps during an epoch, with a batch size of 8. The networks use an Adam optimizer with a learning rate of 0.001 for the actor network and 0.005 for the critic network. The DQN agent tends to lean towards the dominant class, where learning is dominantly happening at the max action value. The A2C network on the other hand, generalizes the learning across the actions and state values independently, resulting in better performance than the DQN network.

4 Experiments and Results

We report the salient features of experimental design and results from simulations using the proposed AI assisted annotator framework in this section.

4.1 Experimental Design

All experiments were run on a MacBook Air with an Intel Core i5 1.8 GHz processor and 8GB of RAM. Training of DQN and A2C agents was done using ATOMICS-1 data. All agent evaluations were done using ATOMICS-2 data. Table 4.1 summarizes the ground truth of the dataset used in these experiments. F1 weighted score is used to compare the performance of various agents trained at different epochs. Figure 2 shows agent performance measured using F1 score. The x-axis in Figure 2

represents the training epochs at which the agents were evaluated. We found the Adam optimizer to be more stable than RMSProp as seen in Figure

2 a). We therefore continued with Adam for the rest of the experiments.

(a)
(b)
Figure 2: RL agents comparison. (a) Comparison of Adam and RMSProp optimizers between DQN and A2C networks. (b) Comparison of agents trained on downsampling sample ranges against F1 score tested on n-10 sampling alarms excluding n-10 trained agent (10 non-alarms surrounding each alarm).
Data subset True Alarms Non-Alarms Total Alarms
ATOMICS-1 437 406 843
Testing n-0 ATOMICS-2 756 468 1224
Testing n-10 ATOMICS-2 756 23035 23791

4.2 Results

The A2C agent performs better compared to DQN in our initial results. Although DQN is stable at around 0.70 F1 score across training epochs, the sparse critical alarm events are not detected and the agent settles in a local minima of maximizing the rewards. All results discussed in this section are based on A2C agents as they perform better in terms of AUC and F1-score than DQN. The A2C agents generalize better when the training data has mixed downsampling ranges compared to single downsampling frequency as seen in Figure 2b for the first 1000 epochs. The training curves for the first 200 epochs in Figure 3 a) shows the average score per episode during training. The score is a cumulative reward gained by the agent at the end of each epoch. The agent learns the reinforce signal steadily.

(a)
(b)
Figure 3: Two examples of the A2C training curves tracking the agent’s average score. (a) Average score per episode using n-1 downsampling. (b) Average score per episode using n-mixed downsampling.

There are no similar false alarms annotation prior work for a fair comparison using RL for generic critical alarms. Most of the prior work [5, 6] is focused on either specific alarm type or traditional machine learning approach. The results shown in Table 2 are for best agents trained on ATOMICS-1 dataset and tested on ATOMICS-2 dataset. In our work, we focus on learning the decision making of an expert annotator to discern non-alarms from critical alarms. Our initial results show that we are able to achieve 72.9% F1-score performance level compared to an expert annotator. Agents perform better when they are exposed to many new states. We see the agents perform best with a mixed event downsampling and a data frequency of milliseconds.

5 Discussion and Future Directions

Healthcare applications using RL are still very sparse as the medical domain is very complex and requires domain expertise. High volumes of event data generated from medical machines such as anesthesia, ventilators, and monitoring systems are a rich source for AI applications. Domain expertise is required to annotate this huge volume of medical events data that is both time consuming and expensive. Supervised and semi-supervised approaches of false alarm detection require feature engineering and domain expertise to scale and generalize which is very data intensive and expensive. In this work, we propose an RL approach to mimic medical domain expertise to annotate critical alarms, and automate such annotation work with good accuracy. We find RL approach to be data efficient, scalable, and generalizable for annotation tasks, which are usually very costly in Healthcare domain.

In distantly supervised approach [6] authors report a sensitivity of 63% with a specificity of 95% for their best models but is limited to false alarms caused by artifacts and technical errors alone. We analyzed our best agent performance with high sensitivity as a target and measured the reduction of false alarms. Our promising results for the best agent with a sensitivity of 71.2%, will be able to reduce the false alarms rate by 64.1%. Feature based false alarm detection work [5] surpasses our results with sensitivity of 95.7% and specificity of 83.9% but requires extensive feature engineering and domain expertise to scale and generalize. Our best agent is able to achieve 88.5% sensitivity in detecting true alarms and 64.1% specificity in identifying false alarms compared to domain expert after analyzing only one week’s worth of data. We are confident our RL agent can learn even better and outperform our initial results with more training data when the agent is exposed to more newer states.

Our initial results are promising and we would like to extend this work to specific alarm types (emergent, urgent, indeterminate), and to prediction tasks in our future work. The limitations of this work are: i) training and testing of the RL agent is limited to one week’s data; ii) no prior similar work for fair comparison of results using RL; iii) alarm detection task is limited to two classes. Our contributions in this work are i) A2C performs and generalizes better than DQN; ii)Adam optimizer is more stable than RMSProp for our experimental results; iii) mixed sampling ranges perform well for RL state representation compared to single downsample; iv) our approach is data efficient, scalable to multiple tasks, and less compute intensive. Furthermore, such methods could pave the way soon to many practical non-clinical applications for an improved process to lower the costs of annotation and generate more labelled data for healthcare applications.

Training sample range Sampling frequency TP FN FP TN AUC F1-score Sensitivity Specificity
n-1 (1 non-alarm surrounding alarm) milli seconds 599 158 215 253 0.665 0.691 0.791 0.540
602 155 233 251 0.665 0.691 0.795 0.536
599 158 217 251 0.663 0.689 0.791 0.536
seconds 659 97 220 248 0.700 0.731 0.871 0.529
612 144 197 271 0.694 0.717 0.809 0.579
617 139 206 262 0.687 0.713 0.816 0.559
n-mixed (0,1,3,5,10 non-alarm surrounding alarm) milli seconds 670 87 256 212 0.669 0.703 0.885 0.452
606 151 196 272 0.690 0.713 0.800 0.581
594 163 168 300 0.712 0.729 0.784 0.641
seconds 625 131 274 194 0.620 0.653 0.826 0.414
595 161 199 269 0.680 0.703 0.787 0.574
586 170 178 290 0.697 0.715 0.775 0.619
Table 2: Summary of results for various agents tested on ATOMICS-2 alarms dataset. Best results are highlighted in bold

Acknowledgments

We thank Michael Potter and Jiahui Guan for helpful comments and suggestions.

References

  • [1] Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V Vasilakos. Machine learning on big data: Opportunities and challenges. Neurocomputing, 237:350–361, 2017.
  • [2] Cao Xiao, Edward Choi, and Jimeng Sun. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. In JAMIA, 2018.
  • [3] Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, and Rajesh Ranganath. Opportunities in machine learning for healthcare. arXiv preprint arXiv:1806.00388, 2018.
  • [4] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24, 2019.
  • [5] Xing Wang, Yifeng Gao, Jessica Lin, Huzefa Rangwala, and Ranjeev Mittu. A machine learning approach to false alarm detection for critical arrhythmia alarms. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pages 202–207. IEEE, 2015.
  • [6] Patrick Schwab, Emanuela Keller, Carl Muroi, David J Mack, Christian Strässle, and Walter Karlen. Not to cry wolf: Distantly supervised multitask learning in critical care. arXiv preprint arXiv:1802.05027, 2018.
  • [7] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
  • [8] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, A Rusu Andrei, and Veness Joel. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [9] Omid Sayadi and Mohammad B Shamsollahi. Life-threatening arrhythmia verification in icu patients using the joint cardiovascular dynamical model and a bayesian filter. IEEE Transactions on Biomedical Engineering, 58(10):2748–2757, 2011.
  • [10] Gari D Clifford, Ikaro Silva, Benjamin Moody, Qiao Li, Danesh Kella, Abdullah Shahin, Tristan Kooistra, Diane Perry, and Roger G Mark. The physionet/computing in cardiology challenge 2015: reducing false arrhythmia alarms in the icu. In 2015 Computing in Cardiology Conference (CinC), pages 273–276. IEEE, 2015.
  • [11] F Plesinger, P Klimes, J Halamek, and P Jurak. Taming of the monitors: reducing false alarms in intensive care units. Physiological measurement, 37(8):1313, 2016.
  • [12] Rebeca Salas-Boni, Yong Bai, Patricia Rae Eileen Harris, Barbara J Drew, and Xiao Hu. False ventricular tachycardia alarm suppression in the icu based on the discrete wavelet transform in the ecg signal. Journal of electrocardiology, 47(6):775–780, 2014.
  • [13] Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300, 2017.
  • [14] Pablo Escandell-Montero, Milena Chermisi, Jose M Martinez-Martinez, Juan Gomez-Sanchis, Carlo Barbieri, Emilio Soria-Olivas, Flavio Mari, Joan Vila-Francés, Andrea Stopper, Emanuele Gatti, et al. Optimization of anemia treatment in hemodialysis patients via reinforcement learning. Artificial intelligence in medicine, 62(1):47–60, 2014.
  • [15] Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981. IEEE, 2016.
  • [16] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Closed-loop control of anesthesia and mean arterial pressure using reinforcement learning. Biomedical Signal Processing and Control, 22:54–64, 2015.
  • [17] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • [18] Leo Kobayashi, Adewole Oyalowo, Uday Agrawal, Shyue-Ling Chen, Wael Asaad, Xiao Hu, Kenneth A Loparo, Gregory D Jay, and Derek L Merck. Development and deployment of an open, modular, near-real-time patient monitor datastream conduit toolkit to enable healthcare multimodal data fusion in a live emergency department setting for experimental bedside clinical informatics research. IEEE Sensors Letters, 3(1):1–4, 2018.
Table 1: Summary of ATOMICS datasets used for training and testing.

5 Discussion and Future Directions

Healthcare applications using RL are still very sparse as the medical domain is very complex and requires domain expertise. High volumes of event data generated from medical machines such as anesthesia, ventilators, and monitoring systems are a rich source for AI applications. Domain expertise is required to annotate this huge volume of medical events data that is both time consuming and expensive. Supervised and semi-supervised approaches of false alarm detection require feature engineering and domain expertise to scale and generalize which is very data intensive and expensive. In this work, we propose an RL approach to mimic medical domain expertise to annotate critical alarms, and automate such annotation work with good accuracy. We find RL approach to be data efficient, scalable, and generalizable for annotation tasks, which are usually very costly in Healthcare domain.

In distantly supervised approach [6] authors report a sensitivity of 63% with a specificity of 95% for their best models but is limited to false alarms caused by artifacts and technical errors alone. We analyzed our best agent performance with high sensitivity as a target and measured the reduction of false alarms. Our promising results for the best agent with a sensitivity of 71.2%, will be able to reduce the false alarms rate by 64.1%. Feature based false alarm detection work [5] surpasses our results with sensitivity of 95.7% and specificity of 83.9% but requires extensive feature engineering and domain expertise to scale and generalize. Our best agent is able to achieve 88.5% sensitivity in detecting true alarms and 64.1% specificity in identifying false alarms compared to domain expert after analyzing only one week’s worth of data. We are confident our RL agent can learn even better and outperform our initial results with more training data when the agent is exposed to more newer states.

Our initial results are promising and we would like to extend this work to specific alarm types (emergent, urgent, indeterminate), and to prediction tasks in our future work. The limitations of this work are: i) training and testing of the RL agent is limited to one week’s data; ii) no prior similar work for fair comparison of results using RL; iii) alarm detection task is limited to two classes. Our contributions in this work are i) A2C performs and generalizes better than DQN; ii)Adam optimizer is more stable than RMSProp for our experimental results; iii) mixed sampling ranges perform well for RL state representation compared to single downsample; iv) our approach is data efficient, scalable to multiple tasks, and less compute intensive. Furthermore, such methods could pave the way soon to many practical non-clinical applications for an improved process to lower the costs of annotation and generate more labelled data for healthcare applications.

Training sample range Sampling frequency TP FN FP TN AUC F1-score Sensitivity Specificity
n-1 (1 non-alarm surrounding alarm) milli seconds 599 158 215 253 0.665 0.691 0.791 0.540
602 155 233 251 0.665 0.691 0.795 0.536
599 158 217 251 0.663 0.689 0.791 0.536
seconds 659 97 220 248 0.700 0.731 0.871 0.529
612 144 197 271 0.694 0.717 0.809 0.579
617 139 206 262 0.687 0.713 0.816 0.559
n-mixed (0,1,3,5,10 non-alarm surrounding alarm) milli seconds 670 87 256 212 0.669 0.703 0.885 0.452
606 151 196 272 0.690 0.713 0.800 0.581
594 163 168 300 0.712 0.729 0.784 0.641
seconds 625 131 274 194 0.620 0.653 0.826 0.414
595 161 199 269 0.680 0.703 0.787 0.574
586 170 178 290 0.697 0.715 0.775 0.619
Table 2: Summary of results for various agents tested on ATOMICS-2 alarms dataset. Best results are highlighted in bold

Acknowledgments

We thank Michael Potter and Jiahui Guan for helpful comments and suggestions.

References

  • [1] Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V Vasilakos. Machine learning on big data: Opportunities and challenges. Neurocomputing, 237:350–361, 2017.
  • [2] Cao Xiao, Edward Choi, and Jimeng Sun. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. In JAMIA, 2018.
  • [3] Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, and Rajesh Ranganath. Opportunities in machine learning for healthcare. arXiv preprint arXiv:1806.00388, 2018.
  • [4] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24, 2019.
  • [5] Xing Wang, Yifeng Gao, Jessica Lin, Huzefa Rangwala, and Ranjeev Mittu. A machine learning approach to false alarm detection for critical arrhythmia alarms. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pages 202–207. IEEE, 2015.
  • [6] Patrick Schwab, Emanuela Keller, Carl Muroi, David J Mack, Christian Strässle, and Walter Karlen. Not to cry wolf: Distantly supervised multitask learning in critical care. arXiv preprint arXiv:1802.05027, 2018.
  • [7] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
  • [8] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, A Rusu Andrei, and Veness Joel. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [9] Omid Sayadi and Mohammad B Shamsollahi. Life-threatening arrhythmia verification in icu patients using the joint cardiovascular dynamical model and a bayesian filter. IEEE Transactions on Biomedical Engineering, 58(10):2748–2757, 2011.
  • [10] Gari D Clifford, Ikaro Silva, Benjamin Moody, Qiao Li, Danesh Kella, Abdullah Shahin, Tristan Kooistra, Diane Perry, and Roger G Mark. The physionet/computing in cardiology challenge 2015: reducing false arrhythmia alarms in the icu. In 2015 Computing in Cardiology Conference (CinC), pages 273–276. IEEE, 2015.
  • [11] F Plesinger, P Klimes, J Halamek, and P Jurak. Taming of the monitors: reducing false alarms in intensive care units. Physiological measurement, 37(8):1313, 2016.
  • [12] Rebeca Salas-Boni, Yong Bai, Patricia Rae Eileen Harris, Barbara J Drew, and Xiao Hu. False ventricular tachycardia alarm suppression in the icu based on the discrete wavelet transform in the ecg signal. Journal of electrocardiology, 47(6):775–780, 2014.
  • [13] Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300, 2017.
  • [14] Pablo Escandell-Montero, Milena Chermisi, Jose M Martinez-Martinez, Juan Gomez-Sanchis, Carlo Barbieri, Emilio Soria-Olivas, Flavio Mari, Joan Vila-Francés, Andrea Stopper, Emanuele Gatti, et al. Optimization of anemia treatment in hemodialysis patients via reinforcement learning. Artificial intelligence in medicine, 62(1):47–60, 2014.
  • [15] Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981. IEEE, 2016.
  • [16] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Closed-loop control of anesthesia and mean arterial pressure using reinforcement learning. Biomedical Signal Processing and Control, 22:54–64, 2015.
  • [17] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • [18] Leo Kobayashi, Adewole Oyalowo, Uday Agrawal, Shyue-Ling Chen, Wael Asaad, Xiao Hu, Kenneth A Loparo, Gregory D Jay, and Derek L Merck. Development and deployment of an open, modular, near-real-time patient monitor datastream conduit toolkit to enable healthcare multimodal data fusion in a live emergency department setting for experimental bedside clinical informatics research. IEEE Sensors Letters, 3(1):1–4, 2018.

References

  • [1] Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V Vasilakos. Machine learning on big data: Opportunities and challenges. Neurocomputing, 237:350–361, 2017.
  • [2] Cao Xiao, Edward Choi, and Jimeng Sun. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. In JAMIA, 2018.
  • [3] Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L Beam, and Rajesh Ranganath. Opportunities in machine learning for healthcare. arXiv preprint arXiv:1806.00388, 2018.
  • [4] Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24, 2019.
  • [5] Xing Wang, Yifeng Gao, Jessica Lin, Huzefa Rangwala, and Ranjeev Mittu. A machine learning approach to false alarm detection for critical arrhythmia alarms. In 2015 IEEE 14th international conference on machine learning and applications (ICMLA), pages 202–207. IEEE, 2015.
  • [6] Patrick Schwab, Emanuela Keller, Carl Muroi, David J Mack, Christian Strässle, and Walter Karlen. Not to cry wolf: Distantly supervised multitask learning in critical care. arXiv preprint arXiv:1802.05027, 2018.
  • [7] Richard S Sutton, Andrew G Barto, et al. Introduction to reinforcement learning, volume 2. MIT press Cambridge, 1998.
  • [8] Mnih Volodymyr, Kavukcuoglu Koray, Silver David, A Rusu Andrei, and Veness Joel. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [9] Omid Sayadi and Mohammad B Shamsollahi. Life-threatening arrhythmia verification in icu patients using the joint cardiovascular dynamical model and a bayesian filter. IEEE Transactions on Biomedical Engineering, 58(10):2748–2757, 2011.
  • [10] Gari D Clifford, Ikaro Silva, Benjamin Moody, Qiao Li, Danesh Kella, Abdullah Shahin, Tristan Kooistra, Diane Perry, and Roger G Mark. The physionet/computing in cardiology challenge 2015: reducing false arrhythmia alarms in the icu. In 2015 Computing in Cardiology Conference (CinC), pages 273–276. IEEE, 2015.
  • [11] F Plesinger, P Klimes, J Halamek, and P Jurak. Taming of the monitors: reducing false alarms in intensive care units. Physiological measurement, 37(8):1313, 2016.
  • [12] Rebeca Salas-Boni, Yong Bai, Patricia Rae Eileen Harris, Barbara J Drew, and Xiao Hu. False ventricular tachycardia alarm suppression in the icu based on the discrete wavelet transform in the ecg signal. Journal of electrocardiology, 47(6):775–780, 2014.
  • [13] Niranjani Prasad, Li-Fang Cheng, Corey Chivers, Michael Draugelis, and Barbara E Engelhardt. A reinforcement learning approach to weaning of mechanical ventilation in intensive care units. arXiv preprint arXiv:1704.06300, 2017.
  • [14] Pablo Escandell-Montero, Milena Chermisi, Jose M Martinez-Martinez, Juan Gomez-Sanchis, Carlo Barbieri, Emilio Soria-Olivas, Flavio Mari, Joan Vila-Francés, Andrea Stopper, Emanuele Gatti, et al. Optimization of anemia treatment in hemodialysis patients via reinforcement learning. Artificial intelligence in medicine, 62(1):47–60, 2014.
  • [15] Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages 2978–2981. IEEE, 2016.
  • [16] Regina Padmanabhan, Nader Meskin, and Wassim M Haddad. Closed-loop control of anesthesia and mean arterial pressure using reinforcement learning. Biomedical Signal Processing and Control, 22:54–64, 2015.
  • [17] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  • [18] Leo Kobayashi, Adewole Oyalowo, Uday Agrawal, Shyue-Ling Chen, Wael Asaad, Xiao Hu, Kenneth A Loparo, Gregory D Jay, and Derek L Merck. Development and deployment of an open, modular, near-real-time patient monitor datastream conduit toolkit to enable healthcare multimodal data fusion in a live emergency department setting for experimental bedside clinical informatics research. IEEE Sensors Letters, 3(1):1–4, 2018.