The field of precision healthcare aims to improve the provision of care through precise and personalised prediction, prevention, and intervention.
In recent years, advances in deep learning (DL) - a subfield of machine learning (ML) - has led to great progress towards personalised predictions in cardiovascular medicine, radiology, neurology, dermatology, ophthalmology, and pathology, just to name a few. For instance,  introduced a DL model that can predict the risk of lung cancer from a patient’s tomography images with a striking 94.4% accuracy;  showed that DL can predict a range of cardiovascular risk factors from just a retinal fundus photograph; and the list continues (more examples can be found in  and ). A key contributing factor to this success, in addition to the developments in DL algorithms, was the massive influx of large multimodal biomedical data, including but not limited to, mega cohorts such as UK Biobank , and routinely-collected health data such as electronic health records (EHR) .
In the recent years, the adoption of EHR systems has greatly increased; percent of hospitals in the US and UK that have adopted EHR systems now exceeds 84% and 94%, respectively [7, 8]. As a result, EHR systems of a national (and/or a large) medical organisation now are likely to capture data from millions of individuals over many years (or decades, sometimes). Each individual’s EHR can link data from many sources (e.g., doctor visits and hospital episodes) and hence contain “concepts” such as diagnoses, interventions, lab tests, clinical narratives, and more. Each instance of a concept can mean a single or multiple data points; just a single hospitalisation alone, for instance, can generate thousands of data points for an individual, whereas a diagnosis can be a single data point (i.e., an ICD code). This makes large-scale EHR a uniquely rich source of insight and an unrivalled data for training data-hungry ML models.
In traditional research on EHR (including the ones using ML), individuals are represented to models as a vector of attributes, or "features". This approach relies on experts’ ability to define the appropriate features, and design the model’s structure (i.e., answering questions such as “what are the key features for this prediction?” or, “which features should have interactions with one another?”). Recent developments in deep learning, however, provided us with models that can learn useful representations (e.g., of individuals, concepts, or an entire record) from raw or minimally-processed data, with minimal need for expert guidance. This happens through a sequence of layers, each employing a large number of simple linear and nonlinear operations to map their corresponding inputs to a representation; the progress from layer to layer, is expected to result in a final representation in which the data points form distinguishable patterns.
As one of the earliest works on applying deep learning to EHR, Liang et al 
showed that deep neural networks can outperform SVM and decision tree paired with manual feature engineering, over a number prediction tasks on a number of different datasets. In another early work in this space, Tran et al
proposed the use of restricted Boltzmann machines (RBM) for learning a distributed representation of EHR, which was shown to outperform the manual feature extraction, when predicting the risk of suicide from individuals’ EHR. In a similar approach, Miotto et al
employed a stack of denoising autoencoders (SDA) instead of RBM, and showed that it outperforms many popular feature extraction and feature transformation approaches (e.g., PCA, ICA and Gaussian mixture models) for providing classifiers with useful features to predict the onset of a number of diseases from EHR.
These early works on the application of DL to EHR did not take into account the subtleties of EHR data (e.g., the irregularity of the inter-visit intervals, and the temporal order or events, to name a few). In attempt to address this, Nguyen et al 
introduced a convolutional neural network (CNN) model called Deepr (Deep record) for predicting the probability of readmission; they treated one’s medical history as a sequence of concepts (e.g., diagnosis and medication) and inserted a special word between each pair of consecutive visits to denote the time difference between them. In another similar attempt, Choi et al.
introduced a shallow recurrent neural network (RNN) model to predict the diagnoses and medications that are likely to occur in the subsequent visit. Both these works employed some form of embedding to map the non-numeric medical concepts to an algebraic space in which the sequence models can operate.
One of the improvements that was next introduced to the DL models of EHR aimed to enable them to capture the long-term dependencies among events (e.g., key diagnoses such as diabetes can stay a risk factor over a person’s life, even decades after their first occurrence; certain surgeries may prohibit certain future interventions). Pham et al 
introduced a Long Short-Term Memory (LSTM) architecture with attention mechanism, called DeepCare, which outperformed standard ML techniques, LSTM, and plain RNN in tasks such as prediction of the onset of diabetes. In a similar development, Choi et al proposed a model based on reverse-time attention mechanism to consolidate past influential visits using an end-to-end RNN model named RETAIN for the prediction of heart failure. RETAIN outperformed most of the models at the time of its publication and provided a decent baseline for the medical deep learning research .
In this study, given the success of deep sequence models and attention mechanisms in the past DL research for EHR, we aim to build on some of the latest developments in deep learning and natural language processing (NLP) – more specifically, Transformer architecture – while taking into account various EHR-specific challenges, and provide improved accuracy for the prediction of future diagnoses. We named our model BEHRT (i.e., BERT for EHR), due to architectural similarities that it has with (and our original inspirations that came from) BERT ; one of the most powerful Transformer-based architectures in NLP.
2 Materials and Methods
In this section, after providing an introduction to our EHR data, we will introduce the earlier works that inspired BEHRT, as well as the novel features that this new architecture contributes to the field.
In this study, we used CPRD (Clinical Practice Research Datalink) [19, 17]; it contains longitudinal primary care data from a network of 674 general physician (GP) practices in the UK, which is linked to secondary care (i.e., hospital episode statistics, or HES) and other health and administrative databases (e.g., office for national statistics’ death registration). Around 1 in 10 GP practices (and nearly 7% of the population) in the UK contribute data to CPRD; it covers 35 million patients, among whom nearly 10 million are currently registered patients . CPRD is broadly representative of the population by age, sex, and ethnicity. It has been extensively validated and is considered as the most comprehensive longitudinal primary care database , with several large-scale epidemiological reports [21, 22, 19] adding to its credibility.
HES, on the other hand, contains data on hospitalisations, outpatient visits, accident and emergency for all admissions to National Health Service (NHS) hospitals in England . Approximately 75% of the CPRD GP practices in England (58% of all UK CPRD GP practices) participate in patient-level record linkage with HES, which is performed by the Health and Social Care Information Centre . In this study, we only considered the data from GP practices that consented to (and hence have) record linkage with HES. The importance of primary care at the centre of the national health system in the UK, the additional linkages, and all the aforementioned properties, make CPRD one of the most suitable EHR datasets in the world for data-driven clinical/medical discovery and machine learning.
2.2 Preprocessing of CPRD
We start our preprocessing with 8 million patients; we only included patients that are eligible for linkage to HES and meet CPRD’s quality standard (i.e., using the flags and time windows that CPRD provides to indicate the quality of one’s EHR). Furthermore, to only keep the records that have enough history to be useful for prediction, we only kept individuals who have at least 5 visits in their EHR. At the end of this process, we are left with million patients to train and evaluate BEHRT on. More details on the steps we took and the number of patients after each one of them can be seen in Fig 1.
In CPRD, diseases are classified using Med Code (which can be simply mapped to NHS’s standard Read Code ) and ICD-10  schemes, for primary and hospital care, respectively. In the ICD-10 universe, one can define diseases at the desired level of granularity that is appropriate for the analysis of interest by simply choosing the level of hierarchy they want to operate at; for instance, operating ICD-10 chapter level will lead to 22 diseases, while operation at full ICD-10 code level will lead to 10,000 diseases. With Med Code, however, such a hierarchy does not exist and hence one needs to carry out exhaustive disease review, in order to define the diseases of interest for a given analysis. To alleviate this extra data-processing burden, we first mapped Med Codes to Read Codes (using the mapping provided by CPRD) and excluded the procedure codes. After that, we mapped both ICD-10 codes (at level 4) and Read Codes to Caliber codes , which is an expert checked mapping dictionary from University College London. Eventually, this resulted in a total of codes for diagnoses. We denote the list of all these diseases as , where denotes the th disease code.
For each patient the medical history consists of a sequence of visits to GP and hospitals; each visit can contain concepts such as diagnosis, medications, measurements and more. In this study, however, we are only considering the diagnoses; we denote each patient’s EHR as , where denotes the number of visits in patient ’s EHR, and contains the diagnoses in the th visit, which can be viewed as a list of diagnoses (i.e., ). In order to prepare the data for BEHRT, we order the visits (hence diseases) temporally, and introduce a term to denote the start of medical history (i.e., ), and the space between visits (i.e., ), which results in a new sequence, , that from now on will be how we see/denote each patient’s EHR as. This process is illustrated in Figure 2.
2.3 BEHRT: A Transformer-based Model for EHR
In this study, we aim to use a given patient’s past EHR to predict his/her future diagnoses (if any), as a multi-task classification problem; this will result in a single predictive model that scales across a range of diseases (as opposed to needing to train one predictive model per disease). Modelling of EHR sequences requires dealing with four key challenges : (C.1) complex and nonlinear interactions among past, present and future concepts; (C.2) long-term dependencies among concepts (e.g., diseases occurring early in the history of a patient effecting events far later in future); (C.3) difficulties of representing multiple heterogeneous concepts of variable sizes and forms to the model; and (C.4) the irregular intervals between consecutive visits.
Similarities between sequences in EHR and natural language led to successful use of techniques such as BoW , Skip-gram , RNN , and attention [15, 28] (a la their NLP usage) for learning complex EHR representations in the past. In this study, we get our inspiration from the striking success that Transformers , and more specifically, a Transformer-based architecture known as BERT . By depicting diagnoses as words, the content of in each visit as a sentence, and a patient’s entire medical history as a document, we facilitate the the use of multi-headed self-attention, positional encoding, and masked language models (MLM), for EHR - under a new architecture we call BEHRT.
We refer readers to the original papers [29, 18] for an exhaustive background description for both Transformer and BERT. Figure 3B illustrates BEHRT’s architecture, which is designed to pre-train deep bidirectional representations of medical concepts by jointly conditioning on both left and right contexts in all layers. The pre-trained representations can be simply employed for a wide range of downstream tasks, e.g., prediction of the next diseases, and disease phenomapping. Such bidirectional contextual awareness of BEHRT’s representations is a big advantage when dealing with EHR, where due to variabilities in practice of care and/or simply due to random events, the order at which certain diseases happen can be reversed, or the time interval between two diagnoses can be shorter or longer than actually recorded.
BEHRT, has many structural advantages over many of the previous methods for modelling EHR data. Firstly, we use feedforward neural networks to model time sensitive EHR data by examining the various forms of sequential order within the data (e.g. age, order of visits) instead of using traditional state-of-the-art RNN and CNN that were explored in the past [13, 16]
. Recurrent neural networks are known to be notoriously hard to train, due to their exploding and vanishing gradient problems; these issues hamper these networks’ ability to learn (particularly, when dealing with long sequential data). On the other hand, convolutional neural networks only capture limited amount of information with convolutional kernels in the lower layers, and need to expand their receptive field though increasing the number of layers in a hierarchical architecture. BEHRT’s feedforward structure alleviates the exploding and vanishing gradient problems and capture information by considering the full sequence at the same time; a more efficient training through learning the data in parallel rather than in sequence (unlike the RNN).
The embedding layer in BEHRT, as shown in Figure 3, learns the evolution of one’s EHR through a combination of four embeddings: disease, "position", age, and "segment". This combination enables BEHRT to define a representation that can capture one’s EHR in as much detail as possible. Disease codes are of course important in informing the model of the future state of one’s health. That is, there many common disease trajectories and multi-morbidity patterns  that knowing one’s past diseases can improve the accuracy of the prediction. Positional encodings are either trainable or pre-determined encodings for a position to determine relative position of words in a sequence of words. Pre-determined encodings are used in this paper to avoid weak learning of positional embedding caused by imbalanced distribution of medical sequence length. Telling the network of a disease’s position, enables it to capture the positional interactions among diseases. Given the feed-forward architecture of our network, positional encodings plays a key role in filling the gap resulting from the lack of a recurrent structure that was the most common/successful approach for learning from sequences. For BEHRT, we followed the same position encoding rule proposed by .
Age and segment are two embeddings that did not exist in the original BERT implementation for NLP and is unique to BEHRT; an attempt to empower it for dealing with the challenges we mentioned earlier. The segment embedding can be either A or B; the purpose of this is to use two trainable vectors to provide extra information for visit separation. Age, of course, is known to be a key risk factor in epidemiology. By embedding age and linking it to each visit/diagnosis, not only we provide the model with a sense of time (i.e., the time between events, as well as a universal notion of time for when things happened that is comparable across patients).
Through a unique combination of the four aforementioned embeddings, we not only provide the model with disease sequences, but also give it with a precise sense of timing of events, and data about the delivery of care. In other words, the model has the ability to learn from the diagnosis history, the age at which diagnoses happened and the pattern at which the patient was visited. All these, when combined, can provide a picture of one’s health that traditionally we might have sought to paint through additional features extracted from EHR data. Of course, we do not advocate for not using the full richness of the EHR, however, the complexity of our architecture when paired with simple subset of EHR can still provide an accurate prediction of one’s future health. Plus, BEHRT’s flexible architecture enables the use of additional concepts, e.g., by simply adding a fifth or sixth (or more) embedding to the existing four.
2.4 Pre-training BEHRT using MLM
In EHR, just like language, it is intuitively reasonable to believe that a deep bidirectional model is more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model. Therefore, we pre-trained BEHRT using the same approach as the original BERT paper , using MLM. That is, we chose 15% of the disease words at random, and modified them according to the following probabilities:
80% of the times replace them word with [MASK]
10% of the times replace them with a random disease word
10% of the times do nothing and keep them unchanged.
Under this setting, BEHRT does not know which of the disease words are masked, so it stores a contextual representation of all of the disease words. Additionally, given the small prevalence of change (i.e., only for 15% of all disease words) will not hamper the model’s ability to understand the EHR language. Lastly, the replacement of the disease words acts as injected noise into the model; it will distract the model from learning the true left and right context, and instead forces the model to fight through the noise and continue learning the overall disease trajectories. The pre-training MLM task was evaluated using precision score , which calculates the ratio of true positive over the number of predicted positive samples (precision calculated at a threshold of 0.5). The average is calculated over all labels and over all patients.
2.5 Disease Prediction
In order to provide a comprehensive evaluation of BEHRT, we assess its learning for three predictive tasks: prediction of concepts in the next visit (T1), prediction of the occurrence of diseases in the next 6 months (T2), and prediction of the occurrence of diseases in the next 12 months (T3). In order to train our model and assess the goodness of its predictions across these tasks, we first randomly allocated the patients into three groups of train, validation and test (each containing 80%, 10% and 10%, or patients, respectively). To define the training examples (i.e., input-output pairs) for T1, we randomly choose an index () for each patient and form and , as input and output, respectively, where is a multi-hot vector of length , with 1 for diseases that exist in . Note that, for each patient, we have only input-output pair.
For both T2 and T3, the formation of input and label are slightly modified. First, patients that do not have 6 or 12 months (for T2 and T3, respectively) worth of EHR (with or without a visit) after will not be included in these analyses. Second, is chosen randomly from , where denotes the highest index after which there is 6 or 12 months (for T2 and T3, respectively) worth of EHR (with or without a visit). Lastly, and are multi-hot vectors of length , with 1 for concepts/diseases that exist in the next 6 and 12 months, respectively. As a result of this final filtering of patients, we had 699K, 391K, and 342K patients for T1, T2, and T3, respectively.
We denote the model’s prediction for patient in the aforementioned tasks as , where the th entry is the model’s prediction of that person having
. The evaluation metrics we used to compares and s, are area under the ROC curve (AUROC)  and average precision score (APS) 
; the latter is a weighted mean of precision and recall numbers achieved at different thresholds. We calculated the APS and AUROC for each patient first, and then averaged the resulting APS and AUROC scores across all patients; this average is the key metric, when comparing BEHRT with other state-of-the-art architectures in the field. The methods we used here for APS and AUROC are described in[33, 34].
To begin with, we used Bayesian Optimisation 
to find the optimal hyperparameters for the MLM pre-training. The main hyperparameters here are the number of layers, the number of attention heads, hidden size, and intermediate size. Intermediate size is the size of the neural network layer titled "intermediate layer". This process resulted in an optimal architecture with 6 layers, 12 attention heads, intermediate layer size of 512, and hidden size of 288; model’s performance in the MLM task described in Section2.4 was 0.6597 in precision score. Further details can be found in Appendix A.
In order for BEHRT’s numerical processes to be applicable to EHR, we first need to map the non-numeric concepts such as diagnoses to a vector space (i.e., disease embedding). Therefore, we start the results by showing the performance of our pre-training process in embedding the diseases, where we mapped each of the diseases into a 288-dimensional vector. Note that, for evaluating an embedding technique – even in NLP where the literature has a longer history and hence is more mature – there is not a single gold standard metric . In this study, our assessment is based on two techniques: visual investigation (i.e., in comparison with medical knowledge), and evaluation in a prediction task. For the former, we used t-SNE  to reduce the dimensionality of the disease vectors to two – results are shown in Figure 4. Based on the resulting patterns in lower dimension, we can see that diseases that are known to co-occur and/or belong to the same clinical groups, are grouped together.
A reassuring pattern that can be seen in Figure 4 is the natural stratification of gender-specific diseases. For instance, diseases that are unique to women (e.g., Endometriosis, Dysmenorrhea, Menorrhagia, …) are quite distant from those that are unique to men (e.g., Erectile Dysfunction, Primary Malignancy of Prostate, …). Such patterns seem to suggest that our disease embedding built an understanding of the context in which diagnoses happen, and hence infer factors such as gender that it is not explicitly fed.
Furthermore, the colour in Figure 4 represent the original Caliber disease chapters (see the legends in the main subplot). As can be seen, natural clusters are formed that in most cases consist of disease of the same chapter (i.e., the same colour). Some of these clusters, however, are correlated but not identical to these chapters; for instance, many Eye and Adnexa diseases are amongst nervous system diseases and many nervous system disease are also among many musculoskeletal diseases. Overall, this map can be seen as diseases’ correspondence to each other based on 1.6 million people’s EHR. Overall, it seems to be safe to say that this embedding seems to make sense and hence it passes the visual evaluation test.
Another interesting property of BEHRT is its self-attention mechanism; it gives our model the ability to find the relationships among events that go beyond temporal/sequence adjacency. This self-attention mechanism is able to unearth deeper and more complex relationships between a disease in one visit and other surrounding diagnoses. We visualise this self-attention using the approach introduced in ; we plot each medical history against itself, and we see how each disease relates to others around it in each patient. The results from a couple of example patients are shown in Figure 5. Note that, since BEHRT is bidirectional, the self-attention mechanism captures non-temporal/non-directional relationships among diseases.
For patient A for example, the self-attention mechanism has shown strong connections between Rheumatoid Arthritis and Enthesopathies and synovial disorders (far in the future of the patient). This is a great example of where attention can go beyond recent events and find long-range dependencies among diseases. Note that, as described earlier and illustrated in Figure 3, the sequence we model is a combination of four embedding (disease, plus age, segment and position) that go through layers of transformations to form a latent space abstraction. While in Figure 5 we labeled the cells with disease names, a more precise labelling will be diseases in their context (e.g., at a given age).
BEHRT after the MLM pre-training can be considered a universal EHR feature extractor that with small additional training can be employed for a range of downstream tasks. In this work, the downstream task of choice is the multi-disease prediction problem that we described in Section 2.5. To train a predictor, we feed the output from BEHRT to a single feed-forward classifier layer and train it three separate time for each of the three tasks (T1-T3) described in Section 2.5. The evaluation of the model’s performance is shown in Table 1, which demonstrates BEHRT’s superior predictive power compared to two of the most successful approaches in the literature (i.e., RETAIN  and DeepR ). We used Bayesian Optimisation to find the optimal hyperparameters for RETAIN and Deepr before assessing them in terms of APS and AUROC scores. More details on their hyperparameter search and optimisation can be found in AppendixA.
|Model Name||Next Visit (APS | AUROC)||Next 6M (APS | AUROC)||Next 12M (APS | AUROC)|
|BEHRT||0.462 | 0.954||0.525 | 0.958||0.506 | 0.955|
|DeepR||0.360 | 0.942||0.393 | 0.943||0.393 | 0.943|
|RETAIN||0.382 | 0.921||0.417 | 0.927||0.413 | 0.928|
Besides comparing the APS, which provides an average view across all patients, all diseases and all thresholds, we are also interested in analysing the model’s performance for predicting for each disease. To do so for a given disease , we only considered the th location in and vectors and calculated AUROC and APS scores, as well as occurrence ratio for comparison. The results for T2 (or, next 6-months prediction task) is shown in Figure 6. For visual convenience, we did not include rare diseases with prevalence of less than 1% in our data. The result shows that the model is able to make predictions with relatively high precision and recall for diseases such as Epilepsy (0.016), Primary Malignancy Prostate (0.011), Polymyalgia Rheumatica (0.013), Hypo or hyperthyroidism (0.047) and Depression (0.0768). A numerical summary of this analysis can be found in be found in Appendix B. Furthermore, a comparison of the general APS/AUROC trends across the three models - BEHRT, RETAIN, and Deepr can be found in Appendix C.
4 Conclusions and Future Works
In this paper, we introduced a novel deep neural model for EHR called BEHRT, which can be pre-trained on a large dataset and then with small fine tuning result in a striking performance in a wide range of downstream tasks. We demonstrated this property of the model by training and testing it on CPRD - one of the largest linked primary care EHR systems. Based on our results, BEHRT outperformed all the best deep EHR models in the literature over a range of diseases (i.e., in a multi-label diagnoses prediction in near future), by 8% (absolute improvement) in any given task.
BEHRT offers a flexible architecture that is capable of capturing more modalities of EHR data. In this paper, we designed and tested a BEHRT model that relied on 4 key embeddings - diseases, age, segment and position. Through this mix, the model will not only have the ability to learn about the past diseases and their relationships with one another (and hence their effect on the next likely disease), but also gain insights about the underlying generating process of EHR; we can refer to this as practice of care. In other words, the model will engineer features (i.e., complex representations) that are capable of capturing concepts such as "this patient had diseases X and Y at young ages, and suddenly, the frequency of visits increased and new diagnoses appeared, which can lead to high chance of the next disease being Z". In future works, one can add more to the four concepts we employed and bring medication, tests and interventions to the model with minimum architectural changes - only a vector addition in Figure 3.
Our primary objective in this study was to provide the field with an accurate predictive models for the prediction of next diseases. However, BEHRT provides multiple byproducts that can be useful on their own and/or as the foundation preprocessing for the future works. For instance, the disease embeddings resulting from BEHRT can provide a great insight to how various diseases are related to each other - it goes beyond simple disease co-occurrence and rather learns to score closeness of diseases based on their trajectories in a big population of patients. Furthermore, the disease correspondence that results from BEHRT’s attention mechanism has been shown to be a useful tool for illustrating the disease trajectories for multi-morbid patients; not only it shows how diseases co-occur, but also it shows the influence of certain disease in the past on future diseases of interest. These correspondences are not strictly temporal but rather contextual. As a future work, we aim to provide these attention-visualisation tools to medical researchers to help them better understand the contextual meaning of a diagnoses in the midst of other diagnoses of patients. Through this tool, medical researchers can even craft medical history timelines based on certain diseases or patterns and in a way, query our BEHRT model and visualiser to perhaps uncover novel disease contexts.
For the future, we wish to make improvements to our model and use ensembles of BEHRTs and variations for better predictive power. Furthermore, we also plan to bring in more medical features such as treatment records and more demographic information (region, ethnicity, etc). Also, as discussed before, since Caliber consolidates specific codes into 301 disease codes and fails to map many other codes, we wish to look into more stable mappings that preserve the comprehensiveness of CPRD and reduce noise in our dataset. In addition to this, we wish to also embark on a deep dive of a single disease, perhaps heart failure or hypertension, to use BEHRT’s diagnostic power for specific disease prediction.
This research was funded by the Oxford Martin School (OMS) and supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The views expressed are those of the authors and not necessarily those of the OMS, the UK National Health Service (NHS), the NIHR or the Department of Health and Social Care. This work uses data provided by patients and collected by the NHS as part of their care and support and would not have been possible without access to this data. The NIHR recognises and values the role of patient data, securely accessed and stored, both in underpinning and leading to improvements in research and care. We also thank Wayne Dorrington for his work in creating figures for this paper (Figure 2 and Figure 3).
-  Diego Ardila, Atilla P Kiraly, Sujeeth Bharadwaj, Bokyung Choi, Joshua J Reicher, Lily Peng, Daniel Tse, Mozziyar Etemadi, Wenxing Ye, Greg Corrado, and David P Naidich. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(June), 2019.
-  Ryan Poplin, Avinash V Varadarajan, Katy Blumer, Yun Liu, Michael V McConnell, Greg S Corrado, Lily Peng, and Dale R Webster. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering, 2(3):158–164, 2018.
Eric J Topol.
human and artificial intelligence.Nature Medicine, 25(January), 2019.
-  Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.
-  Cathie Sudlow, John Gallacher, Naomi Allen, Valerie Beral, Paul Burton, John Danesh, Paul Downey, Paul Elliott, Jane Green, Martin Landray, Bette Liu, Paul Matthews, Giok Ong, Jill Pell, Alan Silman, Alan Young, Tim Sprosen, Tim Peakman, and Rory Collins. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLOS Medicine, 12(3):e1001779, 2015.
-  Benjamin Shickel, Patrick Tighe, Azra Bihorac, and Parisa Rashidi. Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis. IEEE Journal of Biomedical and Health Informatics, 22(5):1589–1604, 2018.
-  O N C Annual Meeting. Electronic Public Health Reporting. None, 2018. Available at: https://www.healthit.gov/sites/default/files/2018-12/ElectronicPublicHealthReporting.pdf.
-  Sonal Parasrampuria and Jawanna Henry. Hospitals’ Use of Electronic Health Records Data, 2015-2017. ONC Data Brief, No. 46, 2019.
-  Fatemeh Rahimian, Gholamreza Salimi-Khorshidi, Amir H Payberah, Jenny Tran, Roberto Ayala Solares, Francesca Raimondi, Milad Nazarzadeh, Dexter Canoy, and Kazem Rahimi. Predicting the risk of emergency admission with machine learning: Development and validation using linked electronic health records. PLoS Medicine, 15(11):1–18, 2018.
-  Znaonui Liang, Gang Zhang, Jimmy Xiangji Huang, and Qmming Vivian Hu. Deep learning for healthcare decision making with EMRs. Proceedings - 2014 IEEE International Conference on Bioinformatics and Biomedicine, IEEE BIBM 2014, pages 556–559, 2014.
-  Truyen Tran, Tu Dinh Nguyen, Dinh Phung, and Svetha Venkatesh. Learning vector representation of medical objects via EMR-driven nonnegative restricted Boltzmann machines (eNRBM). Journal of Biomedical Informatics, 2015.
-  Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports, 6(May):1–10, 2016.
-  Phuoc Nguyen, Truyen Tran, Nilmini Wickramasinghe, and Svetha Venkatesh. Deepr: A Convolutional Net for Medical Records. IEEE Journal of Biomedical and Health Informatics, 21(1):22–30, may 2017.
-  Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. JMLR workshop and conference proceedings, 56:301–318, 2016.
-  Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. DeepCare: A deep dynamic memory model for predictive medicine. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9652 LNAI(i):30–41, 2016.
-  Edward Choi, Mohammad Taha Bahadori, Joshua A. Kulas, Andy Schuetz, Walter F. Stewart, and Jimeng Sun. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. arxiv, 2016.
-  Jose Roberto Ayala Solares, Francesca Elisa Diletta Raimondi, Yajie Zhu, Fatemeh Rahimian, Dexter Canoy, Jenny Tran, Ana Catarina Pinho Gomes, Amir Payberah, Mariagrazia Zottoli, Milad Nazarzadeh, Nathalie Conrad, Kazem Rahimi, and Gholamreza Salimi-Khorshidi. Deep Learning for Electronic Health Records: A Comparative Review of Multiple Deep Neural Architectures. preprint, page 57, 2019.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arxiv, 2018.
-  Emily Herrett, Arlene M Gallagher, Krishnan Bhaskaran, Harriet Forbes, Rohini Mathur, Tjeerd Van Staa, and Liam Smeeth. Data Resource Profile: Clinical Practice Research Datalink (CPRD). International Journal of Epidemiology, 44(3):827–836, 2015.
-  T Walley and A Mantgani. The uk general practice research database. The Lancet, 350(9084):1097 – 1099, 1997.
-  Connor A Emdin, Simon G Anderson, Thomas Callender, Nathalie Conrad, Gholamreza Salimi-Khorshidi, Hamid Mohseni, Mark Woodward, and Kazem Rahimi. Usual blood pressure, peripheral arterial disease, and vascular risk: Cohort study of 4.2 million adults. BMJ (Online), 2015.
-  Connor A. Emdin, Simon G. Anderson, Gholamreza Salimi-Khorshidi, Mark Woodward, Stephen MacMahon, Terrence Dwyer, and Kazem Rahimi. Usual blood pressure, atrial fibrillation and vascular risk: Evidence from 4.3 million adults. International Journal of Epidemiology, 2017.
-  F. Lee, H. R.S. Patel, and M. Emberton. The ’top 10’ urological procedures: A study of hospital episodes statistics 1998-99. BJU International, 2002.
-  Hamid Mohseni, Amit Kiran, Reza Khorshidi, and Kazem Rahimi. Influenza vaccination and risk of hospitalization in patients with heart failure: A self-controlled case series study. European Heart Journal, 2017.
-  NHS. Read Codes. Available at: https://digital.nhs.uk/services/terminology-and-classifications/read-codes.
-  WHO. ICD-10 online versions. Available at https://icd.who.int/browse10/2016/e.
-  Valerie Kuan, Spiros Denaxas, Arturo Gonzalez-izquierdo, Kenan Direk, Osman Bhatti, Shanaz Husain, Shailen Sutaria, Melanie Hingorani, Dorothea Nitsch, Constantinos A Parisinos, R Thomas Lumbers, Rohini Mathur, Reecha Sofat, Juan P Casas, Ian C K Wong, and Harry Hemingway. Articles A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service. The Lancet Digital Health, 1(2):e63–e77, 2019.
-  Kyunghyun Cho. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. arxiv, 2013.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762 [cs], apr 2017.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training Recurrent Neural Networks. arxiv, 2012.
-  The Academy of Medical Sciences. Multimorbidity: a priority for global health research. The Academy of Medical Sciences, pages 1–127, 2018.
-  David M W Powers. Evaluation : From Precision , Recall and F-Factor to ROC , Informedness , Markedness & Correlation. arxiv, 2007.
-  Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 2006.
-  Mu Zhu. Recall, precision and average precision. Department of Statistics and Actuarial Science, …, 2004.
-  Ryan Snoek, Jasper; Larochelle, Hugo; Adams. Practical Bayesian Optimization of Machine Learning Algorithms. NIPS, 2(12):e540, 2017.
-  Bin Wang, Student Member, Angela Wang, Fenxiao Chen, Student Member, Yuncheng Wang, and C Jay Kuo. Evaluating Word Embedding Models : Methods and Experimental Results. arxiv, pages 1–13, 2019.
-  Laurens Van Der Maaten and Geoffrey Hinton. Visualizing Data using t-SNE. JMLR, 9:2579–2605, 2008.
-  Jesse Vig. Visualizing Attention in Transformer-Based Language Representation Models. arxiv, pages 2–7, 2019.
Appendix A Hyperparameter Tuning
We show the hyperparameter tuning results here in the following section. In Table 2, we show the results of the MLM training hyperparameter tuning process. We performed Bayesian Optimization to retrieve optimal parameters for the model.
In the following sections, we also perform hyperparameter searches for the Deepr (Table 3) and RETAIN (Table 4) models to ensure proper comparison of model performance between BEHRT and the aforementioned models.
|Iteration||Hidden Size||Layers||Attention Heads||Intermediate Size||Precision|
|Iteration||Filters||Kernel Size||FC I||FC II||FC III||Dropout I||Dropout II||Dropout III||Learning Rate||Average Precision|
|Iteration||Embedding Size||Recurrent Size||Dropout Embedding||Dropout Context||L2||Average Precision|
Appendix B Disease-wise Model Performance
Here we show Disease-wide BEHRT performance in terms of AUROC and APS. We have displayed codes with an occurrence ratio of 0.01 at the least. And detailed below is the description of the Caliber code and the chapter along with the APS/AUROC.
|92||0.066765||0.723828||Gastritis and duodenitis||0.011198||Diseases of the digestive system|
|66||0.108185||0.797633||Diaphragmatic hernia||0.011490||Diseases of the digestive system|
|100||0.118093||0.742646||Hearing loss||0.021964||Diseases of the ear and mastoid process|
|273||0.132567||0.773249||Spondylosis||0.013459||Diseases of the musculoskeletal system and connective tissue|
|189||0.142594||0.844596||Pleural effusion||0.010229||Diseases of the respiratory system|
|172||0.162186||0.798654||Other anaemias||0.023303||Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism|
|29||0.163355||0.822707||Bacterial Diseases (excl TB)||0.023979||Certain infectious and parasitic diseases|
|130||0.167457||0.804545||Iron deficiency anaemia||0.020780||Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism|
|295||0.169674||0.836610||Urinary Tract Infections||0.022534||Diseases of the genitourinary system|
|69||0.170852||0.811759||Diverticular disease of intestine (acute and chronic)||0.015966||Diseases of the digestive system|
|8||0.177170||0.810068||Allergic and chronic rhinitis||0.023241||Diseases of the respiratory system|
|170||0.181182||0.847938||Osteoporosis||0.013982||Diseases of the musculoskeletal system and connective tissue|
|71||0.184000||0.790655||Dyslipidaemia||0.026010||Endocrine, nutritional and metabolic diseases|
|168||0.184629||0.799592||Oesophagitis and oesophageal ulcer||0.022426||Diseases of the digestive system|
|217||0.203218||0.855593||Primary Malignancy Other Skin and subcutaneous tissue||0.012013||Neoplasms|
|93||0.206455||0.776367||Gastro-oesophageal reflux disease||0.026271||Diseases of the digestive system|
|290||0.208524||0.814481||Type 1 Diabetes Mellitus, Type 2 Diabetes Mellitus, and Diabetes Mellitus – other or not specified||0.021226||Endocrine, nutritional and metabolic diseases|
|3||0.219637||0.869366||Actinic keratosis||0.012490||Diseases of the skin and subcutaneous tissue|
|131||0.220017||0.874319||Irritable bowel syndrome||0.011182||Diseases of the digestive system|
|294||0.223863||0.824114||Urinary Incontinence||0.020057||Diseases of the genitourinary system|
|169||0.234444||0.785766||Osteoarthritis (excl spine)||0.043714||Diseases of the musculoskeletal system and connective tissue|
|176||0.245906||0.834062||Other or unspecified infectious organisms||0.030963||Diseases of the respiratory system|
|95||0.249208||0.894444||Glaucoma||0.011367||Diseases of the eye and adnexa|
|109||0.250573||0.885547||Hyperplasia of prostate||0.020842||Diseases of the genitourinary system|
|184||0.264687||0.879325||Peripheral arterial disease||0.010951||Diseases of the circulatory system|
|79||0.265863||0.762939||Enthesopathies & synovial disorders||0.047036||Diseases of the musculoskeletal system and connective tissue|
|81||0.267187||0.905812||Erectile dysfunction||0.017873||Mental and behavioural disorders|
|140||0.268504||0.867094||Lower Respiratory Tract Infections||0.023518||Certain infectious and parasitic diseases|
|63||0.271027||0.753816||Dermatitis (atopc/contact/other/unspecified)||0.049051||Diseases of the skin and subcutaneous tissue|
|142||0.292598||0.893802||Macular degeneration||0.010752||Diseases of the eye and adnexa|
|274||0.296798||0.889236||Stable Angina||0.032039||Diseases of the circulatory system|
|57||0.301041||0.900088||Coronary heart disease not otherwise specified||0.035177||Diseases of the circulatory system|
|275||0.307238||0.911618||Stroke Not otherwise specified (NOS)||0.023395||Diseases of the nervous system|
|45||0.319433||0.863447||Cataract||0.042099||Diseases of the eye and adnexa|
|1||0.319972||0.845171||Abdominal Hernia||0.019180||Diseases of the digestive system|
|44||0.325143||0.843480||Carpal tunnel syndrome||0.012013||Diseases of the nervous system|
|101||0.334902||0.912117||Heart failure||0.024918||Diseases of the circulatory system|
|164||0.335131||0.879670||Obesity||0.017442||Endocrine, nutritional and metabolic diseases|
|22||0.348523||0.885055||Asthma||0.026133||Diseases of the respiratory system|
|97||0.349361||0.882694||Gout||0.018058||Diseases of the musculoskeletal system and connective tissue|
|65||0.350132||0.942604||Diabetic ophthalmic complications||0.018919||Endocrine, nutritional and metabolic diseases|
|147||0.368465||0.911541||Migraine||0.012028||Diseases of the nervous system|
|229||0.398842||0.904686||Psoriasis||0.011751||Diseases of the musculoskeletal system and connective tissue|
|17||0.410899||0.858498||Anxiety disorders||0.041914||Mental and behavioural disorders|
|146||0.433645||0.969406||Menorrhagia and polymenorrhoea||0.015504||Diseases of the genitourinary system|
|113||0.489456||0.905032||Hypo or hyperthyroidism||0.047897||Endocrine, nutritional and metabolic diseases|
|302||0.491672||0.855823||Vitamin B12 deficiency anaemia||0.014489||Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism|
|51||0.501496||0.923082||Chronic obstructive pulmonary disease (COPD)||0.036869||Diseases of the respiratory system|
|23||0.514881||0.901268||Atrial Fibrillation and flutter||0.077629||Diseases of the circulatory system|
|110||0.531597||0.819527||Hypertension||0.200618||Diseases of the circulatory system|
|61||0.542223||0.950442||Dementia||0.024656||Mental and behavioural disorders|
|62||0.553561||0.877904||Depression||0.076876||Mental and behavioural disorders|
|85||0.573880||0.934049||Female genital prolapse||0.015781||Diseases of the genitourinary system|
|220||0.575574||0.964776||Primary Malignancy Prostate||0.011844||Neoplasms|
|6||0.583305||0.952656||Alcohol Problems||0.014535||Mental and behavioural disorders|
|194||0.647243||0.955062||Polymyalgia Rheumatica||0.013213||Diseases of the musculoskeletal system and connective tissue|
|80||0.648763||0.977907||Epilepsy||0.016104||Diseases of the nervous system|
Appendix C Comparison of APS/AUROC across Models
In Figure 7, we see the general trend of the three models. BEHRT’s predictions remain in the upper right quadrant of the graph (for the most part) denoting higher APS/AUROC than the other two models.