Recently, both suicide fatalities and attempts have increased among US adolescents. Between 2009 and 2014, suicide deaths among youths aged 10-19 increased 17%, and nonfatal suicide attempts that resulted in an emergency department (ED) visit soared 65% (Centers for Disease Control and Prevention, 2013; Pittsenbarger and Mannix, 2014). At the same time, the predictive capability of clinicians and standard statistical models is only slightly better than chance (Franklin et al., 2017). Given the ever increasing volume of electronic health records (EHRs), it is natural to ask whether machine learning methods might be capable of improving the accuracy of suicide attempt prediction. If so, such methods might one day be incorporated into clinical tools that flag high-risk patients, potentially improving their lives.
. This enriched data enables the use of methods from, e.g., natural language processing and ordinal regression. Outside of suicide research, deep learning has enjoyed recent success in the medical informatics literature(Che et al., 2016; Ravì et al., 2017), further motivating the present work.
We have found only two prior papers in which neural network methods of any kind have been used in suicide research. In Iliou et al. (2016), a combination of ad hocfeature selection methods and machine learning methods (including multilayer perceptrons) are applied to data from patients. In Nguyen et al. (2016), deep neural nets with dropout are applied to data from patients. However, both prior studies (Iliou et al., 2016; Nguyen et al., 2016) focus exclusively on patients with a prior history of depression and/or other mental health conditions. In the latter study, the set of predictors includes a suicide risk assessment score for each patient (Tran et al., 2013; Nguyen et al., 2016). The problem with focusing on such data sets is that more than 50% of suicide decedents have no record of psychiatric disorders, psychiatric care-seeking, or suicidality (Husky et al., 2012; Chock et al., 2015; Ahmedani et al., 2014). The present work is the first to use electronic health records from a broad subset of the population, unrestricted by prior conditions/diagnoses, to build neural network models to predict suicide attempts.
This study was approved by UC Merced’s Institutional Review Board. We used nonpublic versions of California emergency department encounter and hospital admissions data from 2006 through 2010. The California Office of Statewide Health Planning and Development (OSHPD) provided anonymized individual-level patient encounter data from all California-licensed hospital facilities, including general acute care, acute psychiatric, chemical dependency recovery, and psychiatric health facilities, but excluding federal hospitals. ED and inpatient data were screened by the OSHPD’s automated data entry and reporting software program (MIRCal); data fields with error rates of 0.1% or higher were returned to the hospitals for correction (Office of Statewide Healthcare Planning and Development, 2017a, b). Patients with missing age were excluded.
The study dataset consists of ED and inpatient records for all adolescent patients aged 10 to 19 years who had a valid unique identifier (encrypted social security number or SSN) and a California residential zip code in 2010 (64.0% of all records for this age group). Unique identifiers were used to link multiple ED visits/episodes per patient over time, including encounters prior (2006-2009) to the adolescent’s first recorded 2010 visit. A total of 522,056 unique, CA-resident adolescents are available for these analyses. Of these, 5,490 adolescent patients presented in 2010 to an ED with a suicide attempt code, i.e., a primary International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM (Medicode, 1996)) External Cause of Injury code (E-code) of E950.0-959.x (Crosby et al., 2011).
Note that prior neural network suicide prediction models were trained using data from, respectively, and patients, nearly times less than the present work (Iliou et al., 2016; Nguyen et al., 2016). Significantly, these and many other machine learning studies restrict their attention to data where all patients have a history of mental illness. Our long-term goal is to build models that can compute individualized risk of suicide attempt/fatality for all ED patients; naturally, this includes those with and without a history of any particular illness. Hence our data set is much larger in size and scope than in prior studies.
Note that we do not have linked death records. While we know that the data includes both fatal and non-fatal suicide attempts, with rare exceptions we cannot tell from the present data set whether a particular attempt was fatal.
In this work, we treat each ED/hospital visit in 2006-2009 by each patient as a separate observation. There are 772,923 such visits in the full data set. To each visit, we assign an outcome variable of either (presented in 2010 with a suicide attempt code, as defined above) or (absence of suicide attempt code in 2010). For each visit, the raw predictors include the patient’s sex, age, race, insurance category, zip code, county of residence, etc. For a full list, please see Table 2
in the Appendix. Importantly, our data does not include textual annotations from medical professionals, survey responses, or numerical mental health evaluations. The vast majority of predictors are categorical; in these cases, we apply one-hot encoding.
The only visit variables that require special treatment are the diagnostic codes associated with the visit. Each visit is assigned one primary and up to six other Clinical Classifications Software (CCS) diagnostic codes. The CCS grouping system aggregates >14,000 ICD-9-CM diagnoses into 285 discrete, mutually exclusive, clinically meaningful category codes (e.g., “anxiety disorders”) that are useful for identifying patients in broad diagnosis groupings. In our analysis, for each visit, we aggregate all seven one-hot encodings of all CCS diagnostic codes. The result is a vectorthat tracks up to seven diagnoses received in one visit.
Within the 2006-2009 data corpus, patients may visit one or more times. To track a patient’s history, we include for a visit at time the cumulative sum of all diagnosis vectors (computed as above) recorded at times less than or equal to . We also record in a new column how many times the patient has been observed in the data set up to and including time . By virtue of this cumulative encoding, each patient’s visit is transformed into a vector of fixed length. The end result is a full data matrix of size .
We randomly reshuffle the rows of and split the result into a pretraining set (80% of the rows) and a test set
(the remaining 20%). We record the column means and variances of, discard columns with zero variance (i.e., unobserved diagnostic codes), and normalize all columns of to have zero mean and unit variance. Note that we are left with columns.
Since only of the values in the pretraining set are equal to , we have an imbalanced classification problem. We sample with replacement from the rows of associated with the minority () class to create a bootstrapped data set with balanced classes. The resulting data set has rows.
As described above, deep neural networks have been underutilized in suicide prediction research. Hence we focus on neural network classifiers consisting of feedforward networks (equivalently, multilayer perceptrons) with dense, all-to-all connections between layers. As the mathematical details of such models are standard, we do not include them here; the interested reader should consult the Appendix. Here we mention the following salient facts.
First, we focus on three networks: NN2, with hidden layers and units per layer; NN4, with hidden layers and units per layer; and NN8, with hidden layers and units in each hidden layer except for the first which has units. We chose the NN8 architecture so that the total number of weights is close to the total number of weights in the NN4 model.
Second, we use the scaled exponential linear unit (SELU) , with constants and (Klambauer et al., 2017)
. This activation function has been engineered to approximately preserve zero-mean, unit-variance normalization across many layers while avoiding vanishing/exploding gradients(Klambauer et al., 2017)
. The use of the SELU activation renders batch normalization unnecessary(Goodfellow et al., 2016).
Finally, the only regularization technique we use is a variant of early stopping (Goodfellow et al., 2016). From the bootstrapped data set, we make a further split into the training set (80%) and a validation set (20%). We monitor the accuracy, sensitivity, and specificity on both the training and validation sets, and halt optimization when we detect insufficient improvement in our metrics on the validation set. Importantly, the test set is not used in this procedure.
|NN2,||0.697||0.982||0.503||0.956||(Tran et al., 2013)-ML||0.37||0.27|
|NN4,||0.703||0.980||0.489||0.958||(Walsh et al., 2017)||0.96||0.75||0.83|
|NN8,||0.697||0.936||0.220||0.919||(Tran et al., 2013)-Clin.||0.081||0.129|
4 Results and Discussion
Table 1 shows results for the three trained networks: NN2, NN4, and NN8 (described above). All results described in this section are test set results. That is, we use the trained network to generate predictions for test set predictors , after normalizing and discarding columns as described above. Note that the prevalence of suicide attempts in the test set is , close to that of the training set.
In the upper-left corner of Table 1, we see the results for the entire test set. NN4 has the best results in categories other than sensitivity; in this metric, the NN8 model is the clear winner. Interestingly, the NN8 network is much more willing than NN2 or NN4 to issue a prediction of
, i.e., that a given visit is associated with a future suicide attempt. With NN2 or NN4, poor choices of hyperparameters lead to models that predictalways.
In the remainder of the left half of Table 1, we have kept the trained models fixed but now evaluated their peformance on subsets of the test set. Specifically, we set a threshold of minimum visits in a patient’s record; in these results, we simply do not issue a prediction if the patient has made less than visits including the present one. The prevalence of suicide attempts in these subsets is, respectively, (), (), (), and (). As one would expect, the performance of all models improves as the amount of data we have about a given patient increases. It is clear that the increases in sensitivity and precision (or positive predictive value) are not solely due to the increased prevalence of suicide attempts as increases. The AUC values for the models are highly encouraging.
In the right-half of Table 1, we have broken down the test set results by the presence of prior CCS diagnostic codes: 662 corresponds to a prior diagnosis of self-injury, 651/657 to mood and anxiety disorders, 659 to schizophrenia and psychotic disorders, and 660/661 to alcohol- and substance-related disorders. Here we find several models with much higher sensitivity and precision (or positive predictive value) than on the test set as a whole. This is to be expected; past history of these particular diagnostic codes is known to be associated with future suicide attempts. Correspondingly, the prevalence of suicide attempts here is higher: (662), (651/657), (659), and (660/661).
In the lower-right corner of Table 1, we have quoted results from the literature. Though these studies used completely different data sets than ours, the comparisons are still interesting. While our NN models outperform both machine and human predictions from Tran et al. (2013), they do not achieve the sensitivity or precision of Walsh et al. (2017); this latter study restricts its modeling efforts to patients with a history of self-injury.
Our results motivate at least four areas for future work. First, we have not unpacked the features learned by the network, i.e., the mapping from the inputs to the last hidden layer . An understanding of the features will help us interpret the NN models and gauge what they have learned; it may eventually lead to the development of interpretable risk factors to predict suicide attempts. Second, we have merely scratched the surface of neural network architectures and regularizations, leaving unexplored recent developments designed to capture the temporal nature of EHRs (Krishnan et al., 2017; Che et al., 2017) and techniques such as dropout (Srivastava et al., 2014), which may improve generalization for deeper networks. Third, while data access/privacy issues may prevent us from running our NN models on data analyzed previously (Tran et al., 2013; Walsh et al., 2017), it may be possible to run others’ models on our data. Fourth, we can reframe the problem as one of quantifying how similar a given patient is to patients from the two categories. In this way, we may be able to circumvent the class imbalance while also effectively modeling group effects.
We acknowledge computational support from the MERCED cluster, supported by National Science Foundation grant ACI-1429783. Research reported in this publication was supported by the National Institutes of Health under award number R15MH113108-01. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, and others. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016. URL https://arxiv.org/abs/1603.04467.
- Ahmedani et al.  Brian K. Ahmedani, Gregory E. Simon, Christine Stewart, Arne Beck, Beth E. Waitzfelder, Rebecca Rossom, Frances Lynch, Ashli Owen-Smith, Enid M. Hunkeler, Ursula Whiteside, Belinda H. Operskalski, M. Justin Coffey, and Leif I. Solberg. Health care contacts in the year before suicide death. Journal of General Internal Medicine, 29(6):870–877, 2014.
- Centers for Disease Control and Prevention  Centers for Disease Control and Prevention. Web-based Injury Statistics Query and Reporting System (WISQARS). Natl Cent Inj Prev Control Centers Dis Control Prev., 2013. URL https://www.cdc.gov/injury/wisqars/fatal.html.
- Che et al.  Zhengping Che, Sanjay Purushotham, Robinder Khemani, and Yan Liu. Interpretable deep models for ICU outcome prediction. In AMIA Annual Symposium Proceedings, volume 2016, page 371. American Medical Informatics Association, 2016. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333206/.
- Che et al.  Zhengping Che, Yu Cheng, Zhaonan Sun, and Yan Liu. Exploiting Convolutional Neural Network for Risk Prediction with Medical Feature Embedding. arXiv:1701.07474 [cs, stat], January 2017. URL http://arxiv.org/abs/1701.07474. arXiv: 1701.07474.
- Chock et al.  Megan M. Chock, Tanner J. Bommersbach, Jennifer L. Geske, and J. Michael Bostwick. Patterns of health care usage in the year before suicide: A population-based case-control study. Mayo Clinic Proceedings, 90(11):1475–1481, 2015.
- Crosby et al.  A. E. Crosby, L. Ortega, and C. Melanson. Self-Directed Violence Surveillance: Uniform Definitions and Recommended Data Elements, Version 1.0. Atlanta, GA: Centers for Disease Control and Prevention, 2011.
- Delgado-Gomez et al.  David Delgado-Gomez, Hilario Blasco-Fontecilla, AnaLucia A. Alegria, Teresa Legido-Gil, Antonio Artes-Rodriguez, and Enrique Baca-Garcia. Improving the accuracy of suicide attempter classification. Artificial Intelligence in Medicine, 52(3):165–168, July 2011. ISSN 09333657. doi: 10.1016/j.artmed.2011.05.004. URL http://linkinghub.elsevier.com/retrieve/pii/S0933365711000595.
- Delgado-Gomez et al.  David Delgado-Gomez, Hilario Blasco-Fontecilla, Federico Sukno, Maria Socorro Ramos-Plasencia, and Enrique Baca-Garcia. Suicide attempters classification: Toward predictive models of suicidal behavior. Neurocomputing, 92:3–8, September 2012. ISSN 09252312. doi: 10.1016/j.neucom.2011.08.033. URL http://linkinghub.elsevier.com/retrieve/pii/S0925231212000835.
- Franklin et al.  Joseph C. Franklin, Jessica D. Ribeiro, Kathryn R. Fox, Kate H. Bentley, Evan M. Kleiman, Xieyining Huang, Katherine M. Musacchio, Adam C. Jaroszewski, Bernard P. Chang, and Matthew K. Nock. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychological Bulletin, 143(2):187–232, 2017. ISSN 1939-1455, 0033-2909. doi: 10.1037/bul0000084. URL http://doi.apa.org/getdoi.cfm?doi=10.1037/bul0000084.
- Goodfellow et al.  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Adaptive Computation and Machine Learning Series. MIT Press, 2016. ISBN 978-0-262-03561-3.
- Haerian et al.  Krystl Haerian, Hojjat Salmasian, and Carol Friedman. Methods for identifying suicide or suicidal ideation in EHRs. In AMIA Annual Symposium Proceedings, volume 2012, page 1244. American Medical Informatics Association, 2012. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3540459/.
- Husky et al.  Mathilde M. Husky, Mark Olfson, Jian-ping He, Matthew K. Nock, Sonja Alsemgeest Swanson, and Kathleen Ries Merikangas. Twelve-month suicidal symptoms and use of services among adolescents: Results from the national comorbidity survey. Psychiatric Services, 63(10):989–996, 2012.
- Iliou et al.  Theodoros Iliou, Georgia Konstantopoulou, Mandani Ntekouli, Dimitrios Lymberopoulos, Konstantinos Assimakopoulos, Dimitrios Galiatsatos, and George Anastassopoulos. Machine Learning Preprocessing Method for Suicide Prediction. In Lazaros Iliadis and Ilias Maglogiannis, editors, Artificial Intelligence Applications and Innovations, volume 475, pages 53–60. Springer International Publishing, Cham, 2016. URL http://link.springer.com/10.1007/978-3-319-44944-9_5.
- Klambauer et al.  Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-Normalizing Neural Networks. Advances in Neural Information Processing Systems (NIPS), June 2017. URL http://arxiv.org/abs/1706.02515. arXiv: 1706.02515.
- Krishnan et al.  Rahul G. Krishnan, Uri Shalit, and David Sontag. Structured Inference Networks for Nonlinear State Space Models. In Thirty-First AAAI Conference on Artificial Intelligence, February 2017. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14215.
- Medicode  Medicode. ICD-9-CM: International classification of diseases. 9th revision, clinical modification edition, 1996.
- Nguyen et al.  Thuong Nguyen, Truyen Tran, Shivapratap Gopakumar, Dinh Phung, and Svetha Venkatesh. An evaluation of randomized machine learning methods for redundant data: Predicting short and medium-term suicide risk from administrative records and risk assessments. arXiv preprint arXiv:1605.01116, 2016. URL https://arxiv.org/abs/1605.01116.
- Office of Statewide Healthcare Planning and Development [2017a] Office of Statewide Healthcare Planning and Development. MIRCal Edit Flag Description Guide: Emergency Department and Ambulatory Surgery Data, volume 95811. Sacramento, CA, 2017a.
- Office of Statewide Healthcare Planning and Development [2017b] Office of Statewide Healthcare Planning and Development. MIRCal Edit Flag Description Guide: Inpatient Data. Sacramento, CA, 2017b.
- Pittsenbarger and Mannix  Zachary E. Pittsenbarger and Rebekah Mannix. Trends in Pediatric Visits to the Emergency Department for Psychiatric Illnesses. Academic Emergency Medicine, 21(1):25–30, January 2014. ISSN 1553-2712. doi: 10.1111/acem.12282. URL http://onlinelibrary.wiley.com/doi/10.1111/acem.12282/abstract.
- Ravì et al.  Daniele Ravì, Charence Wong, Fani Deligianni, Melissa Berthelot, Javier Andreu-Perez, Benny Lo, and Guang-Zhong Yang. Deep learning for health informatics. IEEE Journal of Biomedical and Health Informatics, 21(1):4–21, 2017.
- Ruiz et al.  Francisco Ruiz, Isabel Valera, Carlos Blanco, and Fernando Perez-Cruz. Bayesian nonparametric modeling of suicide attempts. In Advances in Neural Information Processing Systems (NIPS), pages 1853–1861, 2012. URL http://papers.nips.cc/paper/4826-bayesian-nonparametric-modeling-of-suicide-attempts.
- Srivastava et al.  Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. URL http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf.
- Tran et al.  Truyen Tran, Dinh Phung, Wei Luo, Richard Harvey, Michael Berk, and Svetha Venkatesh. An integrated framework for suicide risk prediction. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1410–1418. ACM, 2013. URL http://dl.acm.org/citation.cfm?id=2488196.
- Walsh et al.  Colin G. Walsh, Jessica D. Ribeiro, and Joseph C. Franklin. Predicting risk of suicide attempts over time through machine learning. Clinical Psychological Science, 5(3):457–469, 2017.
- Wilson et al.  Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The Marginal Value of Adaptive Gradient Methods in Machine Learning. arXiv:1705.08292 [cs, stat], May 2017. URL http://arxiv.org/abs/1705.08292. arXiv: 1705.08292.
- Zaher and Buckingham  Nawal A. Zaher and Christopher D. Buckingham. Moderating the influence of current intention to improve suicide risk prediction. In AMIA Annual Symposium Proceedings, volume 2016, page 1274. American Medical Informatics Association, 2016. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5333240/.
Mathematical Description of the Model
To illustrate the architectures of our networks, we present Figure 1. In this diagram, information flows from left to right, from the green input nodes to the red output nodes. Hidden layer receives a vector of inputs from the layer to its left. Suppose there are units or nodes in hidden layer ; then the trainable parameters for this layer consist of , an weight matrix and
, a bias vector. With these parameters, hidden layeroutputs
where denotes transpose and the activation function is applied element-wise.
With the convention that where is the input vector, we have completely described propagation up to the final hidden layer. Suppose there are total hidden layers, so that the output of the final hidden layer is . Then the output layer computes with trainable weight vector and scalar bias . Here.
Hence our neural network computes the probability where stands for all trainable parameters (weights and biases). To train the model, we find that maximizes the log likelihood on the training data, . To maximize
, we employ either stochastic gradient descent (SGD)—for NN8—or SGD with momentum[Wilson et al., 2017]
, for NN2 and NN4. Gradients are computed using backpropagation.
In either case, we employ linear step-size decay; when we use momentum, we set the momentum parameter equal to . Weights for layer
are randomly initialized to have mean zero and standard deviation; biases are initialized to zero.
Table of Predictors
|facility ID number||numeric|
|sex||categorical (4 levels)|
|race||categorical (7 levels)|
|insurance category||categorical (6 levels)|
|disposition||categorical (5 levels)|
|urban||categorical (3 levels)|
|disposition (ED)||categorical (22 levels)|
|facility county (ED)||categorical (55 levels)|
|payer (ED)||categorical (20 levels)|
|CCS diagnostic code||categorical (253 levels)|
|number of visits (up to and including present visit)||numeric|