In this work, we modify and apply self-supervision techniques to the domain of medical health insurance claims. We model patients’ healthcare claims history analogous to free-text narratives, and introduce pre-trained ‘prior knowledge’, later utilized for patient outcome predictions on a challenging task: predicting Covid-19 hospitalization, given a patient’s pre-Covid-19 insurance claims history. Results suggest that pre-training on insurance claims not only produces better prediction performance, but, more importantly, improves the model’s ‘clinical trustworthiness’ and model stability/reliability.
, etc.) has led to continuously improving state-of-the-art results on numerous Natural Language Processing (NLP) tasks. The success of pre-training and self-supervision has recently been expanded to other fields, such as imaging, human activity recognition , molecular data , time series clinical data , etc.
In this work, we modify and apply NLP self-supervision techniques to the domain of medical health insurance claims (a subset of clinical data). The US health insurance process requires providers (physicians and hospitals) to submit detailed visit claim information for the purposes of health insurance payments. Typically, an insurance claim contains billing codes for various medical diagnoses, procedures and medications, relevant to the billing process. These billing codes are comprised of a subset of the patient’s electronic medical record (EMR), and exclude more comprehensive clinical information, such as vital signs and clinical notes. The claims history of a patient can be used for a variety of patient outcome predictions that can help guide and advise patient and provider behaviour for improved health outcomes and healthcare affordability.
We model patients’ anonymized health care claims history as a ‘free-text narrative’ and apply self-supervision to introduce prior knowledge, later utilized for patient outcome predictions. The health insurance claim ‘narrative’ consists of a sequence of diagnosis, procedure, and medication codes submitted for billing purposes, together with some basic demographic information, such as age and gender. An example of the information used from a set of anonymized health insurance claims is shown below:
Age: 65; Gender: Female; Procedure Code(s): G0299 - Direct skilled nursing services of a registered nurse (RN) in the home health or hospice setting, each 15 minutes; Diagnosis Code(s): E119 - Type 2 diabetes mellitus without complications; J4590 - Unspecified asthma; Z4881 - Encounter for surgical aftercare following surgery on specified body systems; Prescribed Medication Codes: 0093-5851 Escitalopram; 33342-054 Pioglitazone; 57664-506 Metoprolol Tartrate; 51248-150 Vesicare.
In this study, we focus on utilizing medical health insurance claims pre-training for predicting hospitalization due to a Covid-19 infection, as efforts to reduce mortality due to Covid-19 include early identification and outreach to patients who have the highest risk of developing severe complications from the disease . Predicting post-Covid-19 hospitalization, given patient’s pre-Covid-19 insurance claim history is an extremely challenging task due to both the clinical complexity of the disease , as well as the inherently limited and noisy nature of insurance claims (containing only a subset of the patient’s EMR, relevant to billing purposes)111The dataset and clinical outcomes used in this research do not represent the true population of Covid-19 infections due to bias inherent in the dataset gathering process. Results are not indented to provide clinical guidance or clinical analysis of Covid infections..
The work relevant to this study falls into 2 categories: machine learning models focusing on the health insurance claims and self-supervision on clinical data, in particular, data present in insurance claims, such as diagnosis, procedure, and medication codes.
The majority of literature focusing on health insurance claims aims to predict fraud, anomalies, and errors in health insurance claims [16, 1, 15, 34, 36, 21] and typically uses traditional data mining and machine learning approaches. A few studies focus on predicting medical outcomes from claims. Hung et al. 
show that a deep neural net and Gradient Boosting Machines (GBM) outperform Support Vector Machines (SVM) and logistic regression on the task of stroke prediction from electronic medical claims. Vekeman et al.37] analyze the conditions of myalgic encephalomyelitis and chronic fatigue syndrome in insurance claim, and conclude that the symptom information in claims is insufficient to identify diagnosed patients. Nagata et al.  apply GBM and LSTM models to predict risk of type-2 diabetes using claims data.
, utilizing word2vec, Glove, continuous bag-of-words model with time-aware attention, as well as graph-based attention models utilizing medical ontologies. More recently, BEHRT applies Bert-like transformer pre-training on Electronic Health Records (EHR) using masked language model that outperforms previous deep EHR representations, such as 
that combines word2vec embeddings with CNN. G-BERT is a model that combines Graph Neural Networks and BERT that learn medical representations from MIMIC III. Med-Bert is another BERT-like model that is pre-trained on data from 28 million patients that outperforms BEHRT and G-BERT . We were unable to find studies that focus specifically on self-supervision for medical claims.
Materials and methods
In this study, a historical anonymized claims dataset is used to pre-train the model. This dataset contains information of 50 million claims submitted in 2019 and 2020 to a major US health insurance provider. A separate internal dataset that contains 471,971 anonymized Covid-19 positive patients (based on lab result or diagnosis) is used to build a model that would detect patients who are at risk of being hospitalized due to Covid-19 complications. This dataset contains prior 3-years of claim records (diagnosis, procedure, and medication codes) of each Covid-19 positive patient and their respective age and gender. To avoid data leakage, claims up to 7 days prior to a Covid-19 positive diagnosis date are dropped as they may contain information relevant to current Covid-19 infection signs and symptoms. Further, age is discretized into clinically meaningful age ranges . Covid-19 related hospitalizations were identified based on the primary diagnosis associated with the hospitalization claim. On the other hand, an individual was considered to be not hospitalized, if the individual had non-hospitalized claims subsequent to the COVID-19 positive diagnosis date or the individual did not have any claims 30 days after the COVID-19 positive diagnosis date. The Covid-19 hospitalization rate for the dataset is 15%. The number is significantly higher than reported in the US  due to an inherent bias in the dataset which contains patients whose Covid-19 positivity was determined solely by the primary diagnosis of hospitalization. The number is also overestimated by the bias in insurance claims submitted Covid-19 tests (excluding Covid-19 tests without insurance claims and individuals with mild symptoms that were not tested).
We compare the performance of 4 prediction models on the task of identifying post-Covid hospitalization, given prior 3 years of medical claims history.
As a simple baseline method, we used mappings of diagnosis and procedure codes to the set of known Covid-19 risk factors, e.g. all neoplasm ICD-10 codes (C00-D49) were converted to the risk-factor variable ‘cancer‘. A total of 25 risk factor variables, together with age and gender, were used to build a logistic regression model on the task. A second baseline method utilizes all available diagnosis, procedure, and drug codes as a bag-of-words representation of the historical claims ‘narrative’ and a Support Vector Machines model. The baseline methods do not use pre-training and utilize only the dataset of 471,971 Covid-19-diagnosed patients.
The third approach utilizes pre-training on diagnosis, procedure, and medications codes, analogous to word embeddings. Word2vec embeddings222Continuous bag of words model, with window size 10. for diagnosis (ICD-10), procedure (Healthcare Common Procedure Coding System), and medication (National Drug Code Directory) codes, each of size 1000, were generated utilizing data from close to 50 million historical claims. The embeddings were then utilized in the Covid-19 positive patients by averaging the embeddings for each type of code respectively (diagnosis, procedure, and medications) for the prior 3 years of the patient’s claims history. The 3 types of averaged embeddings were concatenated together with the demographic information (age and gender) and used in a Gradient Boosting Machine (GBM) model  to predict post-Covid-19 hospitalization status.
Lastly, in the forth approach, the dataset of 50 million historical claims was utilized in a transformer-based masked language model: RoBERTa . Before pre-training, data from each of the 50 million claim records were randomly shuffled. Roberta was trained by masking 30% of the tokens, which include diagnosis/procedure/medications codes, age, and gender. The Roberta model was then fine-tuned on the Covid-19 dataset to predict post-Covid-19 hospitalization status.
Due to the sensitive, clinical nature of the dataset/task and the inherent bias of healthcare claims data, the models’ performance needed to be evaluated not only in terms of metrics, such as precision and recall, but also in terms of ‘clinical trustworthiness’ and model stability/reliability. In an attempt to generate explainable model predictions, we applied the LIME feature attribution model on a random sample of 100 positive and 100 negative predictions for the machine learning models described above. Two clinicians were invited to review the model predictions and determine which model is most clinically ‘trustworthy’ by reviewing explainability results. Unfortunately, due to the size, variability, and both limited and noisy nature of the claims data, the clinicians were not able to utilize the LIME explanations.
As a substitute for human evaluation, we instead measured model stability/reliability by introducing input feature perturbations. For each of our Covid-19 training samples, we substituted each diagnosis/procedure/medication code with the code closest in the corresponding embedding space. Table 1
shows an example of a feature input (medical claim), together with a perturbation automatically generated by substituting each code with the code closest in the pre-trained embedding spaces for diagnoses, procedures and medications. We then measured the differences in prediction probability between the original input and the perturbed input, as well as the differences in the corresponding variable importance scores. The expectations are that such small variations in input should result in minor output / variable significance differences. The perturbations also try to mimic real world coding discrepancies, as medical billing coders have some freedom as to how to code a claim, and the choice of a particular billing code from a set of similar codes is often subjective or circumstantial.
|Input Type||Original Input||Perturbed Input|
|Amoxicillin 100mL||Amoxicillin 75mL|
|administration set||used with nebulizer|
|Review of all||Medication|
|Office or oth||Office or oth|
|outpatient visit||outpatient visit|
|unsp organism||unsp organism|
The source code for all experiments will be made available at the time of publication. 333Due to compliance regulation we are unable to make publicly available the anonymized claims dataset or derived machine learning models..
Table 2 shows the performance of the four models. 70% of the 471,971 Covid-19 positive patients were used for training and cross-validation, and the rest 30% were used for testing. The data used for pre-training consists of claims submitted prior to the first Covid-19 diagnosis in the dataset. As shown, the task proved to be a challenge for all algorithms, with modest precision and recall scores. Results are comparable to results reported in literature utilizing much cleaner, EMR-based datasets  on the same task. Clinicians concurred that the task is challenging for human experts, as it is extremely difficult to predict Covid-19 related hospitalizations based solely on the pre-Covid medical history, lacking Covid-related signs, symptoms, and vital signs. The task is further complicated by the noisy and limited nature of medical claims history. Of the two pre-trained models, only the GBM model was able to surpass the SVM and logistic regression baselines.
Pre-training, however, seemed to have more significant impact on the ‘stability’, ‘trustworthiness’ of the model and its explainability. Table 3 summarizes the prediction probability differences between the original input and the perturbed input, produced by we substituting each diagnosis/procedure/medication code with the code closest in the corresponding embedding space. As individual codes are not used in the logistic regression model, the model was excluded from this evaluation. The differences are summarized in terms of the mean difference between the prediction probability values of the original and perturbed inputs (Predict Prob Diff Mean) and in terms of the prediction agreement between the original and perturbed inputs at a probability threshold of 0.5 (Predict Agreement). The table also shows the mean squared error computed by comparing the LIME variable importance scores of the original input vs. the perturbed input (Var Importance MSE). Statistics were produced based on 5,000 random samples from the test set. While the baseline bag-of-word SVM approach (without preparing) exhibits the lowest probability output variability, the methods using pre-training exhibit higher prediction agreement on the original vs perturbed inputs, as well as less variability in terms of input variable importance. This could suggest that that the predictions of the pre-trained models are more ‘stable’ in terms of both binary prediction outcome, as well as model explainability.
|Predict Prob Diff Mean||4.93||12.90||5.31|
|Var Importance MSE||14.81e-4||0.85e-4||0.98e-4|
Lastly, as a sanity check, we evaluated the 3 model predictions using as input variations of diagnosis and procedure codes for all conditions associated with high risk of Covid-19 hospitalizations 444Centers for Disease Control and Prevention: https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/index.html, such as cancer, chronic kidney disease, COPD, etc. The logistic regression model was again excluded from this analysis, as the model is explicitly based on known Covid-19 risks. As expected, in all cases the models predicted Covid-19 related hospitalization. However, the SVM baseline model probability averaged at 68%, while the probability of the pre-trained models was significantly higher, averaging 94% and 78% for GBM and Roberta respectively, indicating that the pre-trained models are more confident in predicting such ‘clear-cut’ hospitalization examples.
This work demonstrated the utility of self-supervision of medical insurance claims data, which can allow Health Insurance Providers to improve ML model performance on a variety of prediction outcome tasks, aiming to improve patient outcomes and health insurance affordability. Pre-training improved both model prediction performance and model stability on the challenging task of predicting Covid-19 hospitalizations.
Predicting medical provider specialties to detect anomalous insurance claims.
2016 IEEE 28th international conference on tools with artificial intelligence (ICTAI), pp. 784–790. Cited by: Related Work.
-  (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: Introduction.
-  (2020) Generative pretraining from pixels. In International Conference on Machine Learning, pp. 1691–1703. Cited by: Introduction.
-  (2016) Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1495–1504. Cited by: Related Work.
-  (2017) GRAM: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 787–795. Cited by: Related Work.
-  (2016) Learning low-dimensional representations of medical concepts. AMIA Summits on Translational Science Proceedings 2016, pp. 41. Cited by: Related Work.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction.
-  (2008) LIBLINEAR: a library for large linear classification. Journal of machine learning research 9 (Aug), pp. 1871–1874. Cited by: Method.
-  (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: Method.
-  (2020) Hospitalization rates and characteristics of patients hospitalized with laboratory-confirmed coronavirus disease 2019—covid-net, 14 states, march 1–30, 2020. MMWR. Morbidity and mortality weekly report 69. Cited by: Dataset.
-  (2013) Redefining meaningful age groups in the context of disease. Age 35 (6), pp. 2357–2366. Cited by: Dataset.
-  (2020) Masked reconstruction based self-supervision for human activity recognition. In Proceedings of the 2020 International Symposium on Wearable Computers, pp. 45–49. Cited by: Introduction.
-  (2017) Comparing deep neural network and other machine learning algorithms for stroke prediction in a large-scale population-based electronic medical claims database. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3110–3113. Cited by: Related Work.
-  (2017) Code2vec: embedding and clustering medical diagnosis data. In 2017 IEEE International Conference on Healthcare Informatics (ICHI), pp. 386–390. Cited by: Related Work.
-  (2012) A fraud detection approach with data mining in health insurance. Procedia-Social and Behavioral Sciences 62, pp. 989–994. Cited by: Related Work.
-  (2010) Data mining to predict and prevent errors in health insurance claims processing. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 65–74. Cited by: Related Work.
-  (2020) BeHRt: transformer for electronic health records. Scientific Reports 10 (1), pp. 1–12. Cited by: Related Work.
-  (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Method.
-  (2003) Regional variation in medical classification agreement: benchmarking the coding gap. Journal of medical systems 27 (5), pp. 435–443. Cited by: Method.
-  (2018) Kame: knowledge-based attention model for diagnosis prediction in healthcare. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 743–752. Cited by: Related Work.
-  (2019) Using massive health insurance claims data to predict very high-cost claimants: a machine learning approach. arXiv preprint arXiv:1912.13032. Cited by: Related Work.
-  (2020) A comprehensive evaluation of multi-task learning and multi-task pre-training on ehr time-series data. arXiv preprint arXiv:2007.10185. Cited by: Introduction.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: Introduction.
-  (2018) Prediction models for risk of type-2 diabetes using health claims. In Proceedings of the BioNLP 2018 workshop, pp. 172–176. Cited by: Related Work.
-  (2016) Deepr: a convolutional net for medical records. IEEE journal of biomedical and health informatics 21 (1), pp. 22–30. Cited by: Related Work.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Introduction.
-  (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: Introduction.
-  (2018) Improving language understanding by generative pre-training. Cited by: Introduction.
-  (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: Introduction.
-  (2020) Med-bert: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction. arXiv preprint arXiv:2005.12833. Cited by: Related Work.
-  (2016) ” Why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: Method.
-  (2020) Self-supervised graph transformer on large-scale molecular data. Advances in Neural Information Processing Systems 33. Cited by: Introduction.
-  (2019) Pre-training of graph augmented transformers for medication recommendation. arXiv preprint arXiv:1906.00346. Cited by: Related Work.
-  (2012) A scoring model to detect abusive billing patterns in health insurance claims. Expert Systems with Applications 39 (8), pp. 7441–7450. Cited by: Related Work.
-  (2020) Distribution of patients at risk for complications related to covid-19 in the united states: model development study. JMIR public health and surveillance 6 (2), pp. e19606. Cited by: Introduction.
-  (2019) Decision support system (dss) for fraud detection in health insurance claims using genetic support vector machines (gsvms). Journal of Engineering 2019. Cited by: Related Work.
-  (2019) Estimating prevalence, demographics, and costs of me/cfs using large scale medical claims data and machine learning. Frontiers in pediatrics 6, pp. 412. Cited by: Related Work.
-  (2019) Development of a classifier to identify patients with probable lennox–gastaut syndrome in health insurance claims databases via random forest methodology. Current medical research and opinion 35 (8), pp. 1415–1420. Cited by: Related Work.
-  (2020) Personalized predictive models for symptomatic covid-19 patients using basic preconditions: hospitalizations, mortality, and the need for an icu or ventilator. medRxiv. Cited by: Introduction, Results.