MedGPT: Medical Concept Prediction from Clinical Narratives

07/07/2021 ∙ by Zeljko Kraljevic, et al. ∙ King's College London 0

The data available in Electronic Health Records (EHRs) provides the opportunity to transform care, and the best way to provide better care for one patient is through learning from the data available on all other patients. Temporal modelling of a patient's medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorder or complication of a previous or existing disorder. While most prediction approaches use mostly the structured data in EHRs or a subset of single-domain predictions and outcomes, we present MedGPT a novel transformer-based pipeline that uses Named Entity Recognition and Linking tools (i.e. MedCAT) to structure and organize the free text portion of EHRs and anticipate a range of future medical events (initially disorders). Since a large portion of EHR data is in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. MedGPT effectively deals with the noise and the added granularity, and achieves a precision of 0.344, 0.552 and 0.640 (vs LSTM 0.329, 0.538 and 0.633) when predicting the top 1, 3 and 5 candidate future disorders on real world hospital data from King's College Hospital, London, UK (~600k patients). We also show that our model captures medical knowledge by testing it on an experimental medical multiple choice question answering task, and by examining the attentional focus of the model using gradient-based saliency methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction and Related Work

Electronic Health Records (EHRs) hold detailed longitudinal information about each patients’ health status and disease progression, the majority of which are stored within unstructured text. Temporal models utilising such data could be used to predict future events such as a diagnosis, illness trajectory, risk of procedural complications, or medication side-effects. The majority of previous work for prediction or forecasting uses structured datasets or structured data in EHRs. Recently, numerous attempts have been made using BERT-based models. Examples include BEHRT (Li et al., 2020) which uses a limited subset of disorders (301 in total) available in the structured portion of EHRs. BEHRT is limited to predictions of disorders occurring in the next patient hospital visit or a specific predefined time frame, consequently, requiring that the information is grouped into patient visits. In addition, we note that the approach is a multi-label approach, which can cause difficulties as the number of concepts to be predicted increases. Another example is G-BERT (Shang et al., 2019), the inputs for this model are all single-visit samples, which are insufficient to capture long-term contextual information in the EHR. Similarly to BEHRT, only the structured data is used. Next, Med-BERT (Rasmy et al., 2020)

is trained on structured diagnosis data, coded using the International Classification of Diseases. The model is not directly trained on the target task of predicting a new disorder, but fine-tuned after the standard Masked Language Modeling (MLM) task. The model is evaluated on a small subset of disorders which may be insufficient for estimating general performance. Apart from BERT based models, we also note Long Short Term Memory (LSTM) models, like the one proposed by Ethan Steinberg et al.

(Steinberg et al., 2020). Similar to the other models, they only use structured data and fine-tune their model for the prediction of limited future events.

Most deep learning models concerned with predicting a wide range of future medical events tend to focus on the structured portion of EHRs. A large proportion of real world data however is unstructured, uncurated and often has information contained in richer clinical narratives (Jackson et al., 2018). Transforming unstructured data into a machine process-able sequential structured form provides a framework for forecasting the next medical concept in sequence, with a significant increase in granularity and applicability.

In this work we reuse the free text data within the EHR and build a general purpose model that is directly trained on the target task. This work, to some extent, follows the approach outlined in GPTv3 (Brown et al., 2020) and is similar to architectures where different tasks are implicit in the dataset, as an example, one GPTv3 model can generate HTML code, answer questions, write stories and much more without any fine-tuning.

2. Methods

MedGPT is a transformer-based pipeline for medical concept forecasting from clinical narratives. It is built on top of the GPTv2 (Radford et al., 2018) architecture which allows us to do causal language modeling (CLM). EHR data is sequentially ordered in time and this sequential order is important (Singh et al., 2015). As such Masked Language Modeling (MLM) approaches like BERT (Devlin et al., 2019), are not a good fit because when predicting the masked token, BERT models can also look into the future. Formally the task at hand can be defined as: given a corpus of patients where each patient is defined as a sequence of tokens and each token is medically relevant and temporally defined piece of patient data, our objective is the standard language modeling objective:

Note that in this work each of the tokens represents either the patient’s age or a SNOMED-CT disorder concept that relates to the patient and is not negated.

2.1. Named Entity Recognition and Linking

The Medical Concept Annotation Toolkit (MedCAT (Kraljevic et al., 2020)

) was used to extract disorder concepts from free text and link them to the SNOMED-CT concept database. MedCAT is a set of decoupled technologies for developing Information Extraction (IE) pipelines for varied health informatics use cases. It uses self-supervised learning to train a Named Entity Recognition and Linking (NER+L) model for any concept database (in our case SNOMED-CT) and demonstrates state-of-the-art performance. In addition to NER+L, MedCAT also supports concept contextualization with supervised training e.g. Negation detection (is the concept negated or not).

We applied a concept filter to MedCAT and used it to extract only concepts that are marked as disorders in SNOMED-CT (in total 77265 disorders). For the meta-annotations we trained (supervised) two models: Negation and Subject (is the extracted concept related to the patient or someone/something else).

MedCATtrainer (Searle et al., 2019) was used to train the two supervised models for contextualization and to provide supervised ‘top up’ training for the unsupervised NER+L model. We picked the top 100 most frequent disorders in the dataset and sampled 2 documents for each in which they occur. We also sampled 100 random documents from the whole dataset to avoid biasing the training to the most frequent disorders. These 300 documents were first annotated by MedCAT and then manually validated for missing and incorrect annotations, in total 12668 annotations. Annotations were done by AS and ZK predominantly, with clinical annotators helping out. Annotators followed annotation guidelines to keep consistency with additional periodic annotation counter-checking to resolve uncertainties.

2.2. Exploratory Analysis of Modifications for Transformer Based Models

Transformer-based models are currently one of the most popular architectures for deep learning, as a result there is a significant amount of proposed modifications to improve performance. We performed an exploratory analysis of 8 different approaches on top of the base GPTv2 model.

The modifications tested were: 1) Memory Transformers (Burtsev and Sapunov, 2020) - Mikhail S. Burtsev et al. introduced memory tokens and showed that adding trainable memory to selectively store local as well as global representations of a sequence is a promising direction to improve the Transformer model. We used the base version of the memory transformer with 20 memory tokens. 2) Residual Attention (He et al., 2020) - Ruining He et al. showed a simple Residual Attention Layer Transformer architecture that significantly outperformed canonical Transformers on a spectrum of tasks. 3) ReZero (Bachlechner et al., 2020)

- Thomas Bachlechner et al. showed that a simple architecture change of gating residual connections using a single zero-initialized parameter improved the convergence speed for deep networks. 4) Talking Heads Attention

(Shazeer et al., 2020) - Noam Shazeer et al. introduced a variation on multi-head attention which included linear projections across the attention-heads dimension, immediately before and after the softmax operation. 5) Sparse Transformers (Zhao et al., 2019) - Guangxiang Zhao et al. demonstrated that it is possible to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. 6) Rotary embeddings (Su et al., 2021) - Jianlin Su et al. proposed to encode absolute positional information with a rotation matrix and incorporate explicit relative position dependency in self-attention formulation. The approach showed promising results on long sequences. 7) GLU (Shazeer, 2020)

- Noam Shazeer showed that Gated Linear Unites yield quality improvements over standard ReLU or GELU activation functions in transformer based models. 8) Word2Vec - We tested one additional approach where we initialized the transformer token embeddings with pre-calculated embeddings from MedCAT.

Implementations of the modifications were either taken from HuggingFace Transformers (Wolf et al., 2019), x-transformers111 or from the repository published by the authors where applicable.

2.3. Dataset Preparation

Two EHR datasets were used: King’s College Hospital (KCH) NHS Foundation Trust, UK and MIMIC-III (Johnson et al., 2016). No preprocessing or filtering was done on the MIMIC-III dataset of clinical notes and all 2083179 free text documents were used directly. At KCH we collected a total of 18436789 documents (all clinical activity on EHR from 1999 till January 2021), retrieved from the EHR using CogStack (Jackson et al., 2018). We removed all documents that were of bad quality (OCR-related problems) or where there may be ambiguity in presence of disorder (e.g. incomplete triage checklists, questionnaires and forms). After this filtering step we were left with 13084498 documents, each of which had a timestamp representing the time when the document was written. Some documents are continuous - meaning more information is added to them over time, these were split into fragments where each was defined with a time-of-writing. The project operated under London South East Research Ethics Committee (reference 18/LO/2048) approval granted to the King’s Electronic Records Research Interface (KERRI); specific approval in using NLP on unstructured clinical text for extraction of standardised biomedical Codes for patient record structuring was reviewed with expert patient input on a virtual committee with Caldicott Guardian oversight.

Using MedCAT we extracted SNOMED-CT concepts representing disorders attributed to the patient and those that are not negated (based on MedCAT meta-annotations) from both datasets as described above. The concepts were then grouped by patient and only the first occurrence of a concept was kept. Another filtering step was applied to increase the confidence of a diagnosis, namely a concept was kept if it appeared at least twice in the patients EHR. This was then used to create a sequence of pathology for each patient, as shown in Figure 1. Each concept was prepended with patient age, only if the age had changed since the last disorder in the sequence for that patient.

Without any filtering we had 1121218 patients at KCH and 42339 at MIMIC-III, after removal of all disorders with frequency ¡ 100 and all patients that have ¡ 5 tokens we were left with 582548 and 33975 patients respectively. For this work we also limited the length of each sample/patient to 50 tokens (Both KCH and MIMIC-III have more than 98% of patients with less than 50 tokens). The resulting dataset was then split into a train/test set with an 80/20 ratio. The train set was further split into a train/validation set with a 90/10 ratio. The validation set was used for hyperparameter tuning. All presented scores are calculated on the test set. All sampling was random.

3. Results

The performance of MedCAT on our annotated dataset consisting of 12668 annotations was for NER+L, for the meta-annotation Subject and for the meta-annotation Negation . All metrics were calculated on the test set (10% of the total 12668 annotations).

The MedGPT transformer model is built on-top of the GPTv2 architecture. To find the optimal parameters for the base GPTv2 model on our dataset we used Population Based Training (Jaderberg et al., 2017), the best result was achieved with n_layers=6, n_attention_heads=2, embedding_dim=300, weight_decay=0.14, lr=4.46e-5,
batch_size=32 and warmup_steps=15.

3.1. Exploratory Analysis of Modifications for Transformer Based Models

To improve the base GPTv2 model, we undertook an exploratory analysis on 8 recent improvements for transformer based models. As the baseline we used the best model obtained after hyperparameter tuning of the base GPTv2 model, note that we did not do any additional hyperparameter tuning in this step, but only used the parameters obtained for the base model. The two most promising approaches were in the end combined to achieve the best results (see Table 1).

Model \Dataset KCH
P @1 P@3 P @5 H 10+ H 20+
Base GPT 0.342 0.550 0.639 0.380 0.386
Memory Tokens 20 0.341 0.547 0.636 0.376 0.381
Residual Attention 0.337 0.543 0.632 0.370 0.375
ReZero 0.307 0.50 0.603 0.327 0.333
Talking Heads 0.342 0.550 0.638 0.381 0.387
Sparse Top 8 0.341 0.548 0.638 0.380 0.384
Rotary 0.342 0.548 0.639 0.382 0.389
GLU 0.343 0.550 0.640 0.383 0.389
Word2Vec 0.342 0.550 0.640 0.380 0.386
GLU + Rotary 0.344 0.551 0.640 0.383 0.389
Table 1. Precision for next disorder prediction calculated on the dataset from King’s College Hospital. Here means that out of candidates predicted by the model at least one is correct. And is Precision calculated only for disorders appearing at position .

3.2. MedGPT Next Disorder Prediction

The MedGPT model, which consists of the GPTv2 base model with the GLU+Rotary extension, is tested on two datasets KCH and MIMIC-III for the task of predicting the next disorder in a patient’s timeline. We compared our model to a standard Bag of Concepts (BoC) approach with a SVM classifier, as well as a Long Short Term Memory (LSTM) network (Table

2 and 3).

P @1 P@3 P @5 P @1 P@3 P @5
BoC SVM 0.331 - - 0.215 - -
LSTM 0.419 0.657 0.746 0.329 0.538 0.633
MedGPT 0.443 0.681 0.770 0.344 0.551 0.640
Table 2. Precision for next disorder prediction calculated on patients from the MIMIC-III and King’s College Hospital. Here means that out of the candidates predicted by the model at least one of them is correct.
Model \Dataset MIMIC-III
H 0+ H 10+ H 20+ H 30+.
LSTM 0.419 0.394 0.367 0.338
BoC SVM 0.331 0.335 0.319 0.293
MedGPT 0.443 0.431 0.411 0.392
Support (test set) 6795 4244 2048 1068
LSTM 0.329 0.367 0.371 0.365
BoC SVM 0.215 0.250 0.248 0.233
MedGPT 0.344 0.383 0.389 0.386
Support (test set) 116510 58323 22867 11925
Table 3. Precision calculated only for disorders appearing at position 0+, 10+, 20+ or 30+ in a patient’s timeline. This shows the performance of the models with respect to different amounts of historical information.

3.3. Qualitative Analysis and Interpretability

To explore the model’s capabilities a senior clinician crafted 4 Multiple Choice Questions (MCQ) set to challenge the model. We present the model with an imaginary patient until a time point

and ask a MCQ (in all cases there is a medical explanation why one answer should be more likely than the others). As MedGPT calculates the probability of all tokens in the vocabulary when predicting the next disorder, we can display the probability of the disorders in question. Note that the probabilities for the options were normalized and that it is possible that they are not the top predictions by the model. Figure

2 shows the 4 examples and the decision made by MedGPT, and even though the model was not directly trained on a ranking task its predictions align with expert clinical knowledge.

Example 1: As a first test, a high-level term of Diabetes Mellitus was provided in the context of a subtype-specific complication, and we tried to see if it could determine which category of diabetes mellitus is most associated with ketoacidosis. This is a simple binary task which it performed well, consistent with medical literature.

Example 2: For this, a choice of a common condition (diabetic nephropathy) was provided as a distracting choice to a more uncommon scenario (congenital cystic kidney disease). In this scenario, the background (cerebral aneurysm) provided the contextual cue for the rarer diagnosis which MedGPT successfully discerned.

Example 3: To test the longer-attention, this scenario introduced the main pertaining disorders early in the sequence (Psychotic disorder, Bipolar disorder, Schizoaffective disorder). Several other disorders (seizure, epilepsy, ischemic heart disease, hypertensive disorder) were introduced late in the sequence to see if this would disrupt the most likely prediction. MedGPT also successfully handled the necessary indirect inference that the drug treatments which cause the diagnosis were not explicitly stated either.

Example 4: Similar to above Example 3, attention in the presence of distractors were tested through intermixing historical diseases. Primary Sclerosing Cholangitis is the most relevant premorbid diagnosis as it is associated with inflammatory bowel diseases (Crohn’s Disease and Ulcerative Colitis) and the MedGPT was able to distinguish this from several other common conditions.

Patient history Multiple Choice Options P
1 40 -¿ Ketoacidosis in Diabetes Mellitus -¿ Diabetes Mellitus -¿ Hypertension T1 Diabetes Mellitus
T2 Diabetes Mellitus
2 38 -¿ Hypertensive disorder -¿ 41 -¿ Chronic kidney disease -¿ 43 -¿ Subarachnoid haemorrhage -¿ 44 -¿ Cerebral aneurysm -¿ 46 -¿ Microscopic haematuria -¿ 48 Congenital cystic kidney disease
Renal Artery Stenosis
Diabetic Nephropathy

3 21 -¿ Psychotic disorder -¿ 24 -¿ Bipolar disorder -¿ 28 -¿ Schizoaffective disorder -¿ 35 Depressive disorder -¿ 42 -¿ Hypertensive disorder -¿ 44 -¿ Seizure disorder -¿ 49 Epilepsy -¿ 55 -¿ Ischemic heart disease -¿ 58 Mild cognitive disorder -¿ 59 Parkinsonism caused by drug
Drug-induced tardive dyskinesia
Vascular dementia
Alzheimer’s disease
Parkinson’s disease


4 41 -¿ Gastroesophageal reflux disease -¿ 46 -¿ Cholestatic jaundice syndrome -¿ Pancreatitis -¿ 47 -¿ Primary sclerosing cholangitis -¿ 51 -¿ Acute diarrhea -¿ 53 -¿ Pancreatitis -¿ 55 -¿ Hemorrhagic diarrhea -¿ 56 Crohn’s disease
Ulcerative Colitis
Rectal Adenocarcinoma
Figure 2. Qualitative analysis of MedGPT on 4 multiple choice questions. In all cases the model was given the patient history and asked to predict the probability of each of the options. The P column denotes the normalized probability assigned by MedGPT.

To understand why a certain disorder was forecasted we used gradient-based saliency methods (Atanasova et al., 2020). This method allowed us to calculate how important each input token was for the prediction of the next disorder in sequence. We show the potential of the model in Figure 1. Age (49) was the most salient input token followed by the tokens ketoacidosis and ulcer of foot. These disorder tokens are clearly very relevant as the ketoacidosis concept contextualises to patients with diabetes mellitus and the ulcer suggests sub-clinical nerve disease which is common in patients with diabetes. Age is likely to be relevant in most forecasting scenarios. Such weights allow probing the forecast with both qualitative and quantitative interpretations. The potential to provide explainability makes this distinct from current commercial apps using structured question trees that have modest real-world performance metrics (Ćirković, 2020).

4. Conclusion

We demonstrate that unstructured clinical narratives can be used for temporal modelling and forecasting of medical history through a multi-stage process where unstructured data is first parameterised using NER+L into a standardised ontology, then a second step of Transformer-based NLP is applied for forecasting. MedGPT shows potential to overcome real-world scenarios where electronic health records exist at various levels of maturity in data standardisation.

Additionally, MedGPT’s promising ability to choose from a set of differential diagnoses without further training suggests that it has captured relational associations between medical disorders and is able to focus attention to salient parts of the chronology to do so. This will be tested more extensively in future work.


  • P. Atanasova, J. G. Simonsen, C. Lioma, and I. Augenstein (2020) A diagnostic study of explainability techniques for text classification. CoRR abs/2009.13295. External Links: Link, 2009.13295 Cited by: §3.3.
  • T. Bachlechner, B. P. Majumder, H. H. Mao, G. W. Cottrell, and J. J. McAuley (2020) ReZero is all you need: fast convergence at large depth. CoRR abs/2003.04887. External Links: Link, 2003.04887 Cited by: §2.2.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. CoRR abs/2005.14165. External Links: Link, 2005.14165 Cited by: §1.
  • M. S. Burtsev and G. V. Sapunov (2020) Memory transformer. CoRR abs/2006.11527. External Links: Link, 2006.11527 Cited by: §2.2.
  • A. Ćirković (2020)

    Evaluation of four artificial intelligence–assisted self-diagnosis apps on three diagnoses: two-year follow-up study

    J Med Internet Res 22 (12), pp. e18097. External Links: ISSN 1438-8871, Document, Link Cited by: §3.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.
  • R. He, A. Ravula, B. Kanagal, and J. Ainslie (2020) RealFormer: transformer likes residual attention. CoRR abs/2012.11747. External Links: Link, 2012.11747 Cited by: §2.2.
  • R. Jackson, I. Kartoglu, C. Stringer, G. Gorrell, A. Roberts, X. Song, H. Wu, A. Agrawal, K. Lui, T. Groza, D. Lewsley, D. Northwood, A. Folarin, R. Stewart, and R. Dobson (2018) CogStack - experiences of deploying integrated information retrieval and extraction services in a large national health service foundation trust hospital. BMC Medical Informatics and Decision Making 18 (1), pp. 47. External Links: ISSN 1472-6947, Document, Link Cited by: §1, §2.3.
  • M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu (2017)

    Population based training of neural networks

    CoRR abs/1711.09846. External Links: Link, 1711.09846 Cited by: §3.
  • A. E.W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific Data 3 (1), pp. 160035. External Links: ISSN 2052-4463, Document, Link Cited by: §2.3.
  • Z. Kraljevic, T. Searle, A. Shek, L. Roguski, K. Noor, D. Bean, A. Mascio, L. Zhu, A. A. Folarin, A. Roberts, R. Bendayan, M. P. Richardson, R. Stewart, A. D. Shah, W. K. Wong, Z. M. Ibrahim, J. T. Teo, and R. J. B. Dobson (2020) Multi-domain clinical natural language processing with medcat: the medical concept annotation toolkit. CoRR abs/2010.01165. External Links: Link, 2010.01165 Cited by: §2.1.
  • Y. Li, S. Rao, J. R. A. Solares, A. Hassaine, R. Ramakrishnan, D. Canoy, Y. Zhu, K. Rahimi, and G. Salimi-Khorshidi (2020) BEHRT: transformer for electronic health records. Scientific Reports 10 (1), pp. 7155. External Links: ISSN 2045-2322, Document, Link Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2018) Language models are unsupervised multitask learners. External Links: Link Cited by: §2.
  • L. Rasmy, Y. Xiang, Z. Xie, C. Tao, and D. Zhi (2020) Med-bert: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction. CoRR abs/2005.12833. External Links: Link, 2005.12833 Cited by: §1.
  • T. Searle, Z. Kraljevic, R. Bendayan, D. Bean, and R. J. B. Dobson (2019)

    MedCATTrainer: A biomedical free text annotation interface with active learning and research use case specific customisation

    CoRR abs/1907.07322. External Links: Link, 1907.07322 Cited by: §2.1.
  • J. Shang, T. Ma, C. Xiao, and J. Sun (2019) Pre-training of graph augmented transformers for medication recommendation. CoRR abs/1906.00346. External Links: Link, 1906.00346 Cited by: §1.
  • N. Shazeer, Z. Lan, Y. Cheng, N. Ding, and L. Hou (2020) Talking-heads attention. CoRR abs/2003.02436. External Links: Link, 2003.02436 Cited by: §2.2.
  • N. Shazeer (2020) GLU variants improve transformer. CoRR abs/2002.05202. External Links: Link, 2002.05202 Cited by: §2.2.
  • A. Singh, G. Nadkarni, O. Gottesman, S. B. Ellis, E. P. Bottinger, and J. V. Guttag (2015) Incorporating temporal EHR data in predictive models for risk stratification of renal function deterioration. Journal of Biomedical Informatics 53, pp. 220–228. External Links: Document, Link Cited by: §2.
  • E. Steinberg, K. Jung, J. A. Fries, C. K. Corbin, S. R. Pfohl, and N. H. Shah (2020) Language models are an effective patient representation learning technique for electronic health record data. External Links: 2001.05295 Cited by: §1.
  • J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu (2021) RoFormer: enhanced transformer with rotary position embedding. CoRR abs/2104.09864. External Links: Link, 2104.09864 Cited by: §2.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. CoRR abs/1910.03771. External Links: Link, 1910.03771 Cited by: §2.2.
  • G. Zhao, J. Lin, Z. Zhang, X. Ren, Q. Su, and X. Sun (2019) Explicit sparse transformer: concentrated attention through explicit selection. CoRR abs/1912.11637. External Links: Link, 1912.11637 Cited by: §2.2.