Type 2 diabetes mellitus (T2D) is a chronic metabolic disorder characterized by hyperglycemia and is considered one of the main threats to human health (Zimmet et al., 2001). In developed countries, T2D makes up about 85% of diabetes mellitus patients and occurs when either insufficient insulin is produced, the body becomes resistant to insulin or both (World Health Organization et al., 1994). Prediabetes and less severe cases of T2D are initially managed by lifestyle changes, specifically increasing physical exercise, dietary change and smoking cessation (Tuomilehto et al., 2001; Diabetes Prevention Program Research Group et al., 2002; American Diabetes Association et al., 2014). If this yields insufficient glycemic control, pharmacotherapy with glucose-lowering agents (GLAs) like metformin or insulin is started (Turner et al., 1999; American Diabetes Association et al., 2014).
Several studies have indicated that one third to one half of T2D patients are undiagnosed (Harris et al., 1998a; King et al., 1998; Rubin et al., 1994). Additionally, patients often remain undiagnosed for extended periods of time, with average diagnose-free intervals ranging from 4 to 7 years (Harris et al., 1992). The prognosis of untreated patients can deteriorate rapidly as prolonged hyperglycemia can cause serious damage to many of the body’s systems. Timely diagnosis of T2D proves challenging in contemporary medicine, as many patients already present signs of complications of the disease at the time of clinical diagnosis of T2D (Harris et al., 1998b; Rajala et al., 1998; Kohner et al., 1998; Ballard et al., 1988; Harris and Eastman, 2000; Hu et al., 2002).
Earlier diagnosis and subsequent treatment is believed to prevent or delay complications and improve prognosis (Pauker, 1993; Engelgau et al., 2000). When impaired glucose tolerance is diagnosed early, initial treatment can often be limited to lifestyle changes (Pan et al., 1997; Tuomilehto et al., 2001; Diabetes Prevention Program Research Group et al., 2002). Compared to pharmacotherapy, lifestyle changes are simple, fully manageable by the patient and far less likely to cause serious treatment-induced complications like hypoglycemia (Seltzer, 1989; Zammitt and Frier, 2005). Complementary to health benefits, early diagnosis of T2D poses a health economical advantage, as patients that do not require acute or intensive long-term treatment are far less demanding on the health care system.
Universal screening for T2D is cost-prohibitive (Wareham and Griffin, 2001; Engelgau et al., 2000), but many organizations advise opportunistic screening of high-risk subgroups (World Health Organization et al., 1994; Alberti et al., 1998; Engelgau et al., 2000; American Diabetes Association et al., 2014). Several risk profiling strategies have been developed to aid in the timely diagnosis of T2D (Baan et al., 1999; Stern et al., 2002; Lindström and Tuomilehto, 2003; McNeely et al., 2003; Charlone et al., 2004; Heikes et al., 2008; Schwarz et al., 2009). Risk profiling is typically done by assessing some of the key risk factors for T2D, which include obesity (Mokdad et al., 2003), genetic predisposal (Shai et al., 2006; Consortium et al., 2013), lifestyle (Reis et al., 2011) and various clinical parameters. Existing risk profiling approaches are implemented via questionnaires, potentially augmented with clinical information that is available to the patient’s general practitionner (Griffin et al., 2000; Spijkerman et al., 2004; Lindström and Tuomilehto, 2003; Glümer et al., 2004; Schulze et al., 2007; Heikes et al., 2008). Commonly required information includes BMI, family history, exercise and smoking habits and various clinical parameters.
In this work, we present an alternative approach for risk profiling which only requires data that is already available to Belgian mutual health insurers. This work was done in collaboration with the National Alliance of Christian Mutualities (NACM). NACM is the largest Belgian mutual health insurer with over four million members. Our approach does not require any questionnaires or additional clinical information and predicts whether a patient will start taking GLAs in the next few years. Interestingly, our approach works well despite the fact that Belgian health insurer data contains little direct information regarding key risk factors of T2D, that is weight, lifestyle and family history are all unavailable.
2 Existing type 2 diabetes risk profiling approaches
The Cambridge Risk Score (CRS) was developed to assess the probability of undiagnosed T2D based on data that is routinely available in primary care records, including age, sex, medication use, family history of diabetes, BMI and smoking status(Griffin et al., 2000), The CRS has been shown to be useful on multiple occasions (Griffin et al., 2000; Park et al., 2002; Spijkerman et al., 2004), though its AUC seems to depend heavily on the population in which it is used, ranging between (Spijkerman et al., 2004) and (Griffin et al., 2000). The information used in the CRS is comparable to another approach which obtained AUCs ranging between and (Baan et al., 1999).
The FINDRISC score is based on a 10-year follow-up using age, BMI, waist circumference, history of antihypertensive drugs and high blood glucose, physical activity and diet with reported AUCs of and in predicting drug-treated diabetes (Lindström and Tuomilehto, 2003). The strongest reported predictors in this study were BMI, waist circumference, history of high blood glucose and physical activity. Glümer et al. (2004) developed a risk score based on age, sex, BMI, known hypertension, physical activity and family history of diabetes with AUC ranging from to . The German diabetes risk score reached AUCs ranging from to on validation data and is based on age, waist circumference, height, history of hypertension, physical activity, smoking, and diet (Schulze et al., 2007).
Heikes et al. (2008)
developed a decision tree for risk prediction achievingAUC in a cross-validation setting, based on weight, age, family history and various clinical parameters. Various other approaches based on routine clinical information have demonstrated similarly accurate predictions of type 2 diabetes (Stern et al., 2002; McNeely et al., 2003).
3 Health expenditure data
The Belgian health care insurance is a broad solidarity-based form of social insurance. Mutual health insurers such as NACM are the legally-appointed bodies for managing and providing the Belgian compulsory health care and disability insurance, among other things. To implement their operations, Belgian mutual health insurers dispose of large databases containing health expenditure records of all their respective members.
These expenditure records hold all financial reimbursements of drugs, procedures and contacts with health care professionals. Each record comprises a timestamp, financial details and a description of the claim. The financial aspect is irrelevant from a medical point of view, but the type of resource-use as indicated by the description can contain medical information about the patient. These types belong to one of two main categories:
Drug purchases are recorded per package. The coding of packages contains information about the active substances in the drug along with the volume of the package.
Medical provisions are identified by a national encoding along with an identifier of the associated medical caregiver. Each provision has a distinct code number.
In addition to resource-use data, some biographical information is available about each patient including age, gender, place of residence and social parameters. In the remainder of this Section we will elaborate on expenditure records related to drugs and provisions. Subsequently we will briefly summarize the main strengths and limitations of using health expenditure data for predictive modeling.
3.1 Records related to drug purchases
Expenditure records concerning drug purchases contain information about the active substances in the drug and the purchased volume. We mapped all active substances onto the anatomical therapeutic chemical (ATC) classification system maintained by the WHO Collaborating Centre for Drug Statistics Methodology (2015). The ATC classification system divides active substances into different groups based on the organ or system on which they act and their therapeutic, pharmacological and chemical properties. Each drug is classified in groups at 5 levels in the ATC hierarchy: fourteen main groups (1st level), pharmacological/therapeutic subgroups (2nd level), chemical subgroups (3rd and 4th level) and the chemical substance (5th level).
After mapping records onto the ATC classification system, a patient’s medication history consists of specific ATC codes (5th level) along with the associated number of defined daily doses (DDD). In the period of interest, purchases of 4,580 distinct active substances were recorded in the NACM database. Table 1 shows an example of the classification of active substance on all levels in the ATC system.
|1||A||alimentary tract and metabolism|
|2||A10||drugs used in diabetes|
|3||A10B||blood glucose lowering drugs, excluding insulins|
3.2 Records related to medical provisions
Expenditure records concerning medical provisions can be considered tuples containing time-stamped identifiers of the patient, physician and medical provision. A single patient-physician interaction may yield multiple such records, one for each specific provision that occurred.
In the Belgian health care system, medical provisions are encoded via the Belgian nomenclature of medical provisions (Van den Oever and Volckaert, 2008), which is maintained by the National Institute for Health and Disability Insurance (NIHDI).111The website of NIHDI is available at http://www.riziv.fgov.be. This nomenclature is an unstructured list of unique codes (numbers) for each provision that is being refunded. Nomenclature numbers are added when new provisions are defined or when revisions are made. A single provision may correspond to multiple numbers for various reasons.
3.3 Advantages of health expenditure data
The key benefit of expenditure databases is that they centralize structured medical information across all medical stakeholders to yield a comprehensive, longitudinal overview of each patient’s medical history. Other health data sources are fragmented, e.g. medical records maintained by the patient’s general practitioner or hospital often contain only a subset of the patient’s medical history. This fragmentation hampers the identification of patterns that may indicate elevated risk for diseases like type 2 diabetes. The NACM database comprises claims records of over four million Belgians, which enables complex modeling. Additionally, claims data have few omissions due to the financial incentive for patients and medical stakeholders (hospitals) to claim refunds. While other health data sources may contain more detailed information, the strength of NACM’s data is in its volume, both in terms of number of patients and the amount of information that is recorded per individual. Finally, as most people tend to stay affiliated with the same mutual health insurer, their expenditure records provide long-term information.
3.4 Limitations of health expenditure data
Belgian health expenditure data is strictly limited to what is required for mutual health insurers to implement their operations, which are mainly administrative in nature. Detailed health information such as diagnoses and test results are not directly available. In some other countries, health insurers dispose of more detailed information, such as ICD-10 codes which include diagnoses and symptoms (World Health Organization et al., 2012). Including such information is out of scope of this work as we focus exclusively on data that is already available to Belgian mutual health insurers. Biographical information about patients does not contain direct information about some important risk factors such as lifestyle, family history and BMI, though this may be partially embedded indirectly in medical resource-use.
In this Section we define the prediction task and describe all its aspects: the overall setup (Section 4.1), the data and its representation (Section 4.2) and the learning algorithms (Section 4.3). Briefly, our aim is to predict which patients will start glucose-lowering pharmacotherapy within the next 4 years, based on expenditure records of the previous 4 years.
Our key hypothesis is that patients with increased risk for T2D or those that are already afflicted but not diagnosed have a different medical expenditure history than patients without impaired glycemic control. We essentially use the start of GLA therapy as a proxy for diagnosis of (advanced) type 2 diabetes. This is reasonable since most patients that start GLA therapy above 40 years old have T2D (World Health Organization et al., 1994).
We posed this task as a binary classification problem. Our classifiers produce a numeric level of confidence that a given patient will start glucose-lowering pharmacotherapy. When predicting a population, the outputs can be used to rank patients according to decreasing confidence that the patients will start glucose-lowering therapy. Highly ranked patients represent a high-risk subgroup which can be targetted for clinical screening.
The full learning setup is described in Section 4.1
, involving different learning methods and representations of patients’ expenditure data. Briefly, we used nested cross-validation to obtain unbiased estimates of the predictive performance of each vectorization and learning approach. Predictive performance of all models was quantified via (area under) receiver operating characteristic (ROC) curves.
Our work is based on a subset of the expenditure records of NACM. All data extractions and analyses were performed at the Medical Management Department of the NACM under supervision of the Chief Medical Officer. The other research partners received no personally identifiable information (including small cells) from NACM. The patient selection and vector representations are described in detail in Section 4.2.
The positive class was defined as patients that require GLAs for long-term glycemic control.222GLAs are defined as any drug in ATC category A10, which includes metformin, sulfonylurea and insulin. The negative class is then defined as patients that do not need GLAs. Expenditure records related to GLAs were used to identify a set of known positives. However, the absence of such records in a patient’s resource use history is not proof that this patient has no need for GLAs. This subtle difference is crucial, because it is well known that patients with impaired glycemic control or T2D often remain undiagnosed and hence untreated for a very long time (Harris et al., 1998a; King et al., 1998; American Diabetes Association et al., 2014). As we cannot identify negatives, we had to build models from positive and unlabeled data.
Learning binary classifiers from positive and unlabeled data (PU learning) is a well-studied branch of semi-supervised learning(Lee and Liu, 2003; Elkan and Noto, 2008; Mordelet and Vert, 2014; Claesen et al., 2015b)
. PU learning is more challenging than fully supervised binary classification, since it requires special learning approaches and quality metrics for hyperparameter optimization that account for the lack of known negatives. We benchmarked three PU learning methods, which are discussed in more detail in Section4.3.
The entire data analysis pipeline was implemented using open-source software. For general data transformations and preprocessing we used SciPy and NumPy (Jones et al., 2001; Van Der Walt et al., 2011). The learning algorithms we used are available in scikit-learn and EnsembleSVM (Pedregosa et al., 2011; Claesen et al., 2014b) . Finally, we used Optunity for automated hyperparameter optimization (Claesen et al., 2014a).
4.1 Experimental setup
We gathered all expenditure records during the 4-year interval of 2008 up to 2012. The selection protocol and representations of patients’ medical resource-use are discussed in detail in Section 4.2. All vector representations of patients include age (in years), an indicator variable for gender and positive entries related to the patient’s medical resource-use. A patient vector can be written in the following general form, where and denote the number of features in the vectorization of medication and provision use, respectively:
In Sections 4.2.1 and 4.2.2 we explain how records related to medication purchases and provisions were represented in vector form. All entries in the vector representations were consistently normalized to the interval by dividing feature-wise by the percentile and subsequently clipping where necessary. These normalized vector representations are used as inputs for the learning algorithms described in Section 4.3.
summarizes the full machine learning pipeline, which starts from expenditure records and ends with models to predict whether a patient will start glucose-lowering pharmacotherapy along with an estimate of their generalization performance. We used nested cross-validation to estimate generalization performance of different learning configurations(Varma and Simon, 2006). The outer 3-fold cross-validation is used to estimate generalization performance of the full learning approach. Internally, twice iterated 10-fold cross-validation was used to find optimal hyperparameters for every learning method.
We used Optunity’s particle swarm optimizer to identify suitable hyperparameters for each approach based on the given training set as defined by the outer cross-validation procedure(Claesen et al., 2014a). Every tuple of hyperparameters was evaluated using twice iterated 10-fold cross-validation on the training set. Per technique, the hyperparameters that maximized cross-validated performance were selected and used to train a model on the full training set.
Models are compared based on area under the ROC curve. ROC curves visualize a classifier’s performance spectrum by depicting its true positive rate (TPR)333TPR measures the fraction of true positives that are correctly identified by the classifier. as a function of its false positive rate (FPR)444FPR measures the fraction of true negatives that are incorrectly identified by the classifier. while varying the decision threshold to decide on positives. Area under the ROC curve (AUROC) is a useful summary statistic of a classifier’s performance. AUROC is equal to the probability that the classifier ranks a random positive higher than a random negative and is known to be equivalent to the Wilcoxon test of ranks (Hanley and McNeil, 1982).
Computing ROC curves
Full label knowledge is required to compute ROC curves. In previous work, we introduced a method to compute bounds on ROC curves based on positive and unlabeled data (Claesen et al., 2015a). Briefly, it is based on the positions of known positives in a ranking produced by a given classifier and requires two things:
The rank distributions of labeled and latent positives must be comparable. This holds when known and latent positives follow the same distribution in input space (ie. the vector representation of patients). This is a fair assumption in our application, since we specifically ignore records after the start of glucose-lowering pharmacotherapy while identifying the set of positives (see Section 4.2), so the medication regimen of known positives has not yet diverged from the regimen of untreated patients.
An estimate of the fraction of latent positives in the unlabeled set is needed, that is the fraction of members that have never used GLAs but are likely to start glucose-lowering pharmacotherapy. In the period – roughly of members of NACM aged 40 or higher started using GLAs. Underestimating results in an underestimated ROC curve and vice versa (Claesen et al., 2015a). We opted to be conservative and used to estimate lower bounds and for upper bounds.
We consistently used the lower bounds for hyperparameter search. All our performance reports contain lower and upper bounds, based on and , respectively.
In addition to measuring performance, we diagnosed overfitting via the concept of rank distributions as defined by Claesen et al. (2015a). The rank distribution of a subset of test instances is defined as the distribution of the positions of these test instances in a ranking of the full test set based on a model’s predicted decision values. We diagnose overfitting based on the rank distributions of known positive training instances () and known positives in the independent test fold () after predicting the full data set. If the model overfits, the rank distribution of is inconsistent with the rank distribution of . Specifically, ranks in are worse than those in when the model overfits. This can be quantified via the Mann-Whitney U test (Mann and Whitney, 1947) based on ranks of and after predicting the full data set (that is all outer folds). The Mann-Whitney U test is expected to yield a non-significant result when the rank distributions of and are comparable. We report the average -values of the test across outer cross-validation folds for each model (low -values indicate overfitting).
4.2 Data Set Construction
We constructed a data set containing records of patients born before 1973 (e.g. or more years old in 2012). Patients with records of glucose-lowering agents (GLAs) during less than 30 days were discarded. Patients with records of glucose-lowering therapy prior to 2012 were discarded. Patients that joined NACM after 2005 were also discarded, as we cannot determine whether these patients used GLAs in the recent past.
All patients that started glucose-lowering pharmacotherapy in 2012 or later are included as known positives (), along with unlabeled patients that were sampled at random from the remaining NACM members (). Known positives have a minimum of 30 days between the first and last purchase of GLAs to avoid contaminating the data set with false positives, for instance due to insulin use in surgical and medical ICUs (Van den Berghe et al., 2001, 2006). It must be noted that some false positives remain, that is patients that use GLAs but not for glycemic control.
4.2.1 Representation of medication records
The simplest way to represent medication purchases during a time interval is by having one input dimension per active substance (level 5 ATC codes) and counting the purchased volume in terms of DDDs. This representation is easy to construct but fails to capture any similarity between active substances, such as the system or organ on which they act.
We can directly use the hierarchical structure of the ATC system to define a measure of similarity between drugs. To impose structure between drugs we included input dimensions related to more generic levels of the ATC hierarchy (levels 1 to 4). On more generic levels we summed all DDD counts of active substances per category (level 5). This redundancy allowed us to express similarity between different active substances with a standard inner product. By normalizing every feature to the unit interval, we obtained the desired effect that patients with comparable drug use on ATC level 5 are more similar than patients that only share coefficients on more generic levels. Figure 2 illustrates this vector representation of trees and the effect of normalization.
All vectorizations related to drug purchases are described in Table 2.
|atc 5||counts of DDDs per medication class in ATC level 5||4,580|
|atc 1–4||counts of DDDs per medication class in ATC levels 1–4||1,257|
|atc 1–5||counts of DDDs per medication class in ATC levels 1–5||5,837|
4.2.2 Representation of provision records
When considering a specific time period, we can describe records by a (sparse) three-dimensional tensor containing frequency counts as illustrated in Figure3. We filtered all provisions with a description containing diabetes, insulin and glucose and provisions not recorded with a physician identifier. After filtering, 5,799 distinct provision codes remain (denoted by ).
Each patient is modelled by a histogram of their provisions in the period of interest. This essentially means we compute the sum over the -component of the tensor representation to obtain a matrix, in which rows and columns represent patients and provisions, respectively. Unfortunately, the encoding of provisions has no medically relevant structure in contrast to the ATC hierarchy for drugs as discussed in Section 4.2.1.
In order to define a reasonable similarity measure between patients, we first had to impose a structure onto the nomenclature that captures similarity between provisions. To structure provisions, we should not use information originating from the patient matrix, as this may cause information leaks (since the patient matrix is used directly in our models for prediction). Instead, we used the complementary physician matrix as a basis to define similarity between provisions, which essentially serves as a proxy for the medical specializations to which each provision belongs. We started from cosine similarity between nomenclature codes based on the physician matrix. We used cosine similarity because it is known to work well for text mining with bag-of-words representations, which is comparable to our use case as it also features sparse, high dimensional input spaces. The cosine similaritybetween two row vectors and is defined as:
Using cosine similarity we can construct a pair-wise similarity matrix between provisions based on the rows of the physician matrix :
expresses similarity between provision codes based on the physicians that provide them and can be regarded as a proxy for the medical subdomain each provision frequently occurs in. In our context, its entries range from (completely orthogonal) to (exact similarity). To impose sparsity we set all entries of below to . Its structure is visualized in Figure 4, which clearly indicates that our approach successfully identifies some coherent groups of provisions.
Finally, the structured representation of provisions is defined as the matrix product between the patient matrix and the provision similarity matrix :
approximately captures which provisions occur in a patient’s history with redundancy based on medical specializations.
All vectorizations related to medical provisions are described in Table 3.
|provs flat||entries taken from the patient matrix|
|provs struct||captures similarity between provisions|
|provs both||concatenation of flat & structured|
4.3 Modeling approaches
Having only positive and unlabeled data (PU learning) presents additional challenges for learning algorithms. Two broad classes of approaches exist to tackle these problems: (i) two-phase methods that first attempt to identify likely negatives from the unlabeled set and then train a supervised model on the positives and inferred negatives (Liu et al., 2002; Yu, 2005) and (ii) approaches that treat the unlabeled set as negatives with label noise (Elkan and Noto, 2008; Lee and Liu, 2003; Mordelet and Vert, 2014; Claesen et al., 2015b).
We have tested three approaches from the latter category in this work, namely class-weighted SVM (Liu et al., 2003), bagging SVM (Mordelet and Vert, 2014) and the robust ensemble of SVM models (Claesen et al., 2015b)
. All of these approaches are based on support vector machines. We used the linear kernel on vector representations of patients as described in Section4.2.555Though it must be noted that the ensemble methods are always implicitly nonlinear. We will briefly introduce each method in the following subsections.
4.3.1 Class-weighted SVM
Class-weighted SVM (CWSVM) uses a misclassification penalty per class. CWSVM was first applied in a PU learning context by Liu et al. (2003), by considering the unlabeled set to be negative with noise on its labels. A CWSVM is trained to distinguish positives () from unlabeled instances (), leading to the following optimization problem:
where are the support values, is the label vector, is the kernel function, is the bias term and are the slack variables for soft-margin classification. The misclassification penalties and require tuning. We used the implementation available in scikit-learn (Pedregosa et al., 2011) based on LIBSVM (Chang and Lin, 2011).
4.3.2 Bagging SVM
In bagging SVM, random resamples are drawn from the unlabeled set and CWSVM classifiers are trained to discriminate all positives from each resample (Mordelet and Vert, 2014). Resampling the unlabeled set induces variability in the base models which is exploited via bagging. Base model predictions are aggregated via majority voting.
Bagging SVM with linear base models has two hyperparameters, namely the size of resamples of the unlabeled set and the misclassification penalty on unlabeled instances . The misclassification penalty on positives is fixed via the following rule:
denotes the number of known positives. The heuristic rule in Equation6 is common in imbalanced settings (Cawley, 2006; Daemen et al., 2009). We implemented bagging SVM using the EnsembleSVM library (Claesen et al., 2014b).
4.3.3 Robust ensemble of SVM models
The robust ensemble of SVM models (RESVM) is a modified version of bagging SVM in which both the positive and unlabeled sets are resampled when constructing base model training sets (Claesen et al., 2015b). The extra resampling induces additional variability between base models which improves performance when combined with a majority vote aggregation scheme. Claesen et al. (2015b) demonstrated that resampling the positive set provides robustness against false positives, which makes RESVM appealing for our application since our data set is known to contain a small fraction of false positives (as explained in Section 4.2).
When using linear base models, the RESVM approach has four hyperparameters that must be tuned, namely resample sizes and misclassification penalties per class. This approach was implemented based on EnsembleSVM (Claesen et al., 2014b).
5 Results and discussion
Section 5.1 shows the predictive performance per learning configuration and compares these performances to the current state-of-the-art in large-scale risk assessment for T2D. Section 5.2 shows performance curves of the best configuration, which enable us to determine suitable cutoffs to identify target groups in practice. Finally, Section 5.3 describes a simple approach to assess which features contribute most to risk according to our best models.
5.1 Benchmark of learning methods
Table 4 summarizes the performance of each learning configuration. The age,gender feature set provides a baseline for comparison, all other feature sets include these as well. As shown in the results, this two-dimensional representation already carries some information.
|RESVM||bagging SVM||class-weighted SVM|
|features||AUROC ()||AUROC ()||AUROC ()|
Based on Table 4 we can conclude that a patient’s medication history is highly informative to predict the start of GLA therapy. Using features based on ATC level 5, the RESVM model obtained an AUC between and . By adding redundancy as described in Section 4.2.1 the performance based on medication history alone was further increased to between and for the best learning approach (RESVM).
Predictive performance based on provisions alone turned out fairly poor, showing only a mild improvement compared to models based exclusively on age and gender for all learning algorithms. Interestingly, the best approach for representations based on provisions was class-weighted SVM, with RESVM being worst of all three learning methods. It appears that for these representations, large training sets are better: class-weighted SVM uses the full training set, bagging SVM uses all positives and a subset of unlabeleed instances per base model and RESVM uses (small) subsets of both positives and unlabeled instances per base model.
The best representation included age, gender, and structured information about drugs and provision history of the patient. The best learning method on this representation was RESVM, achieving an AUC between and . In Section 5.1.1 we compare the performance of our approach to competing screening methods.
Finally, RESVM appears most resistant to overfitting in the hyperparameter optimization stage as it consistently exhibits the highest average -values in our diagnostic test (higher is better, see Section 4.1). We believe this to be attributable to the use of small resamples of both positives and unlabeled instances when training base models in RESVM, since this makes it unlikely to obtain a structural overfit of the ensemble model on the full training set. In contrast, bagging SVM is far more prone to overfitting because every base model is trained on all positives.
5.1.1 Comparison to state-of-the-art
Our best approach obtained cross-validated AUC between and (exact numbers are unknown due to the lack of known negatives). This is comparable to many competing approaches, based on questionnaires and some clinical information such as the Cambridge Risk Score (AUC 67%–80%, (Spijkerman et al., 2004; Griffin et al., 2000)), the Danish risk score (AUC –, (Glümer et al., 2004)), the German diabetes risk score (AUC –, (Schulze et al., 2007)) and a Dutch approach (AUC 74%, (Baan et al., 1999)). Approaches using detailed clinical information generally perform better, but are more expensive to maintain (Stern et al., 2002; McNeely et al., 2003; Lindström and Tuomilehto, 2003; Heikes et al., 2008). They key advantage of our approach is the fact it is easy to implement on a population wide scale at virtually no operational cost.
The target class we used in this work is stricter than in the risk prediction methods mentioned in Section 2, namely patients that require GLAs for glycemic control versus patients with impaired glycemic control, respectively (except for Lindström and Tuomilehto (2003), which also predicted drug-treated T2D). It is reasonable to assume that our models generally rank patients with impaired glycemic control but without a need for GLAs higher than patients without impaired glycemic control. In our performance assessment both of these patient groups are essentially treated as negatives, in contrast to the screening programmes mentioned previously which treat patients with impaired glycemic control as positives. Hence, we believe the performance of our models would appear higher when evaluated against a target class comprising all patients with impaired glycemic control, as is done in the evaluation of other screening approaches. Unfortunately, we are unable to accurately identify patients with impaired glycemic control but without need for GLAs.
All competing methods use either clinical information or direct knowledge of risk factors that is unavailable to us. Furthermore, the characteristics that are lacking in our data have been reported to be the most informative to assess risk for T2D (Lindström and Tuomilehto, 2003; Stern et al., 2002; McNeely et al., 2003). We obtained generalization performances that are comparable to existing approaches, despite these missing predictors. Finally, our approach is the only one that is based exclusively on existing data that is always available, without requiring additional patient contacts or clinical tests.
5.2 Receiver Operating Characteristic curves for RESVM
The RESVM model based on atc provs vectorization had the best overall performance. Figure 5 shows bounds on the ROC and PR curves for this model. These bounds were computed using the technique described by Claesen et al. (2015a). The true curve is unknown because we do not dispose of negative labels.
ROC curves enable us to determine a cutoff to use in practice, based on a suitable balance between true and false positive rate (sensitivity and 1-specificity, respectively). Determining a suitable balance requires a tradeoff between the relative importance of identifying undiagnosed patients (true positives) vis-à-vis increased amounts of screening tests on patients that are in fact healthy (false positives).
It should be noted that precision depends on class balance, and therefore the PR curve shown in Figure 5(b) is not representative for screening an overall population, since the overall population has a higher fraction of negatives than our custom data set (i.e. precision would be lower in practice). In contrast, the bounds in ROC space are representative because ROC curves are insensitive to changes in class distribution (Fawcett, 2006).
5.3 Feature importance analysis for the RESVM model
The RESVM model is implicitly nonlinear due to its majority voting rule to aggregate base model decisions, which poses problems in assessing the importance of each predictor. However, our use of linear base models enables a simple approximation. The decision value for base model for a test instance can be written as follows:
is the separating hyperplane andis a bias term. A simple linear approximation of such ensemble models can be computed as the average of all base model hyperplanes:
Feature importance can then be determined based on the coefficients in . Since we normalized all features to the unit interval we can conclude that the features with largest (positive) coefficients contribute most to risk as identified by our model.
Via this approach, the risk associated to use of cardiovascular medication (ATC main category c) far outweights all other ATC main categories. This is not surprising, as diabetes is known to be strongly related to cardiovascular problems (Kannel and McGee, 1979; Grundy et al., 1999; Hu et al., 2002). The relative importance of features will be discussed in detail in a subsequent medical paper.
In this work we have demonstrated the ability to predict clinical outcomes based solely on readily available health expenditure data. We successfully built proof-of-concept classifiers to predict the start of glucose-lowering pharmacotherapy in patients above 40. Our experiments show that accurate predictions can be made based on historical medication purchases. These predictions can be further improved by incorporating information about medical provisions and the use of appropriate vectorization schemes.
Since adult patients starting glucose-lowering pharmacotherapy are mainly afflicted with type 2 diabetes (T2D), our models can be used for T2D risk assessment. Our approach presents a novel method for case finding which can be easily incorporated in modern healthcare, since all required data is already available. The associated operational costs are very low as the entire workflow can be fully automated without any need for patient contacts or medical tests. As such, our work provides an efficient and cost-effective method to identify a high risk subgroup, which can then be screened using decisive clinical tests.
Interestingly, our approach works well even though health expenditure data contains very limited direct information on some important known risk factors. In that sense, our approach is fundamentally different from the current state-of-the-art which mainly focuses on quantifying known risk factors directly, either by asking the patient or through clinical tests. The performance of our approach is expected to improve further when additional information about these risk factors can be obtained, e.g. family history and lifestyle.
The authors wish to thank Bernard Debbaut, Frie Niesten, Koen Cornelis and Michiel Callens for their valuable input in various aspects of this study.
This study was supported by the Flemish Government (FWO: projects: G.0871.12N (Neural circuits); IWT: TBM-Logic Insulin(100793), TBM Rectal Cancer(100783), TBM IETA(130256); PhD grants; Industrial Research fund (IOF): IOF Fellowship 13-0260; iMinds Medical Information Technologies SBO 2015, ICON projects (MSIpad, MyHealthData); VLK Stichting E. van der Schueren: rectal cancer) and the Belgian Federal Government (FOD: Cancer Plan 2012-2015 KPC-29-023 (prostate); COST: Action: BM1104: Mass Spectrometry Imaging). M.C. is funded by a PhD grant awarded by IWT (#111065). P.G is funded by a clinical research foundation of the University Hospitals Leuven-KUL.
- Alberti et al. (1998) KGMM Alberti, Mayer B Davidson, Ralph A DeFronzo, Allan Drash, Saul Genuth, Maureen I Harris, Richard Kahn, Harry Keen, William C Knowler, Harold Lebovitz, et al. Report of the expert committee on the diagnosis and classification of diabetes mellitus. Diabetes Care, 21:S5, 1998.
- American Diabetes Association et al. (2014) American Diabetes Association et al. Standards of medical care in diabetes–2014. Diabetes care, 37(Supplement 1):S14–S80, 2014.
- Baan et al. (1999) Caroline A Baan, Johannes B Ruige, Ronald P Stolk, JC Witteman, Jacqueline M Dekker, Robert J Heine, and EJ Feskens. Performance of a predictive model to identify undiagnosed diabetes in a health care setting. Diabetes Care, 22(2):213–219, 1999.
- Ballard et al. (1988) David J Ballard, Linda L Humphrey, L Joseph Melton, Peter P Frohnert, Chu-Pin Chu, W Michael O’Fallon, and Pasquale J Palumbo. Epidemiology of persistent proteinuria in type II diabetes mellitus: population-based study in Rochester, Minnesota. Diabetes, 37(4):405–412, 1988.
Gavin C Cawley.
Leave-one-out cross-validation based model selection criteria for
International Joint Conference on Neural Networks (IJCNN’06), pages 1661–1668. IEEE, 2006.
- Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
- Charlone et al. (2004) G Charlone, L Torsten, C Bendix, et al. A Danish diabetes risk score for targeted screening. Diabetes Care, 27:727–733, 2004.
- Claesen et al. (2014a) Marc Claesen, Jaak Simm, Dusan Popovic, Yves Moreau, and Bart De Moor. Easy hyperparameter search using Optunity. arXiv preprint arXiv:1412.1114, 2014a.
- Claesen et al. (2014b) Marc Claesen, Frank De Smet, Johan A.K. Suykens, and Bart De Moor. EnsembleSVM: A library for ensemble learning using support vector machines. Journal of Machine Learning Research, 15:141–145, 2014b. URL http://jmlr.org/papers/v15/claesen14a.html.
- Claesen et al. (2015a) Marc Claesen, Jesse Davis, Frank De Smet, and Bart De Moor. Assessing binary classifiers using only positive and unlabeled data. In arXiv, 2015a.
- Claesen et al. (2015b) Marc Claesen, Frank De Smet, Johan A.K. Suykens, and Bart De Moor. A robust ensemble approach to learn from positive and unlabeled data using SVM base models. Neurocomputing, 160(0):73 – 84, 2015b. ISSN 0925-2312. doi: http://dx.doi.org/10.1016/j.neucom.2014.10.081. URL http://www.sciencedirect.com/science/article/pii/S0925231215001174.
- Consortium et al. (2013) InterAct Consortium et al. The link between family history and risk of type 2 diabetes is not explained by anthropometric, lifestyle or genetic risk factors: the EPIC-InterAct study. Diabetologia, 56(1):60–69, 2013.
- Daemen et al. (2009) Anneleen Daemen, Olivier Gevaert, Fabian Ojeda, Annelies Debucquoy, Johan A.K. Suykens, Christine Sempoux, Jean-Pascal Machiels, Karin Haustermans, and Bart De Moor. A kernel-based integration of genome-wide data for clinical decision support. Genome Medicine, 1(4):39, 2009.
- Diabetes Prevention Program Research Group et al. (2002) Diabetes Prevention Program Research Group et al. Reduction in the incidence of type 2 diabetes with lifestyle intervention or metformin. The New England journal of medicine, 346(6):393, 2002.
- Elkan and Noto (2008) Charles Elkan and Keith Noto. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’08, pages 213–220, New York, NY, USA, 2008. ACM.
- Engelgau et al. (2000) Michael M Engelgau, KM Narayan, and William H Herman. Screening for type 2 diabetes. Diabetes Care, 23(10):1563–1580, 2000.
- Fawcett (2006) Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, June 2006. ISSN 0167-8655. doi: 10.1016/j.patrec.2005.10.010. URL http://dx.doi.org/10.1016/j.patrec.2005.10.010.
- Glümer et al. (2004) Charlotte Glümer, Bendix Carstensen, Annelli Sandbæk, Torsten Lauritzen, Torben Jørgensen, and Knut Borch-Johnsen. A Danish diabetes risk score for targeted screening the Inter99 study. Diabetes Care, 27(3):727–733, 2004.
- Griffin et al. (2000) SJ Griffin, PS Little, CN Hales, AL Kinmonth, and NJ Wareham. Diabetes risk score: towards earlier detection of type 2 diabetes in general practice. Diabetes/metabolism research and reviews, 16(3):164–171, 2000.
- Grundy et al. (1999) Scott M Grundy, Ivor J Benjamin, Gregory L Burke, Alan Chait, Robert H Eckel, Barbara V Howard, William Mitch, Sidney C Smith, and James R Sowers. Diabetes and cardiovascular disease a statement for healthcare professionals from the American Heart Association. Circulation, 100(10):1134–1146, 1999.
- Hanley and McNeil (1982) James A Hanley and Barbara J McNeil. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36, 1982.
- Harris and Eastman (2000) Maureen I Harris and Richard C Eastman. Early detection of undiagnosed diabetes mellitus: a US perspective. Diabetes/metabolism research and reviews, 16(4):230–236, 2000.
- Harris et al. (1992) Maureen I Harris, Ronald Klein, Tim A Welborn, and Matthew W Knuiman. Onset of NIDDM occurs at least 4–7 yr before clinical diagnosis. Diabetes Care, 15(7):815–819, 1992.
- Harris et al. (1998a) Maureen I Harris, Katherine M Flegal, Catherine C Cowie, Mark S Eberhardt, David E Goldstein, Randie R Little, Hsiao-Mei Wiedmeyer, and Danita D Byrd-Holt. Prevalence of diabetes, impaired fasting glucose, and impaired glucose tolerance in US adults: the Third National Health and Nutrition Examination Survey, 1988–1994. Diabetes Care, 21(4):518–524, 1998a.
- Harris et al. (1998b) Maureen I Harris, Ronald Klein, Catherine C Cowie, Michael Rowland, and Danita D Byrd-Holt. Is the risk of diabetic retinopathy greater in non-hispanic blacks and mexican americans than in non-hispanic whites with type 2 diabetes?: A US population study. Diabetes Care, 21(8):1230–1235, 1998b.
- Heikes et al. (2008) Kenneth E Heikes, David M Eddy, Bhakti Arondekar, and Leonard Schlessinger. Diabetes risk calculator a simple tool for detecting undiagnosed diabetes and pre-diabetes. Diabetes Care, 31(5):1040–1045, 2008.
- Hu et al. (2002) Frank B Hu, Meir J Stampfer, Steven M Haffner, Caren G Solomon, Walter C Willett, and JoAnn E Manson. Elevated risk of cardiovascular disease prior to clinical diagnosis of type 2 diabetes. Diabetes Care, 25(7):1129–1134, 2002.
- Jones et al. (2001) Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001. URL http://www.scipy.org/. [Online; accessed 2015-04-16].
- Kannel and McGee (1979) WB Kannel and DL McGee. Diabetes and cardiovascular disease: The Framingham study. JAMA, 241(19):2035–2038, 1979. doi: 10.1001/jama.1979.03290450033020.
- King et al. (1998) Hilary King, Ronald E Aubert, and William H Herman. Global burden of diabetes, 1995–2025: prevalence, numerical estimates, and projections. Diabetes Care, 21(9):1414–1431, 1998.
- Kohner et al. (1998) Eva M Kohner, Stephen J Aldington, Irene M Stratton, Susan E Manley, Rury R Holman, David R Matthews, and Robert C Turner. United Kingdom Prospective Diabetes Study, 30: diabetic retinopathy at diagnosis of non–insulin-dependent diabetes mellitus and associated risk factors. Archives of Ophthalmology, 116(3):297–303, 1998.
Lee and Liu (2003)
Wee Sun Lee and Bing Liu.
Learning with positive and unlabeled examples using weighted logistic regression.In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 448–455, 2003.
- Lindström and Tuomilehto (2003) Jaana Lindström and Jaakko Tuomilehto. The diabetes risk score a practical tool to predict type 2 diabetes risk. Diabetes Care, 26(3):725–731, 2003.
- Liu et al. (2002) Bing Liu, Wee Sun Lee, Philip S. Yu, and Xiaoli Li. Partially supervised classification of text documents. In ICML ’02: Proceedings of the Nineteenth International Conference on Machine Learning, pages 387–394, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. ISBN 1-55860-873-7.
- Liu et al. (2003) Bing Liu, Yang Dai, Xiaoli Li, Wee Sun Lee, and Philip S. Yu. Building text classifiers using positive and unlabeled examples. In Proceedings of the Third IEEE International Conference on Data Mining, ICDM ’03, pages 179–186, Washington, DC, USA, 2003. IEEE Computer Society. ISBN 0-7695-1978-4.
Mann and Whitney (1947)
Henry B Mann and Donald R Whitney.
On a test of whether one of two random variables is stochastically larger than the other.The annals of mathematical statistics, pages 50–60, 1947. doi: 10.1214/aoms/1177730491.
- McNeely et al. (2003) Marguerite J McNeely, Edward J Boyko, Donna L Leonetti, Steven E Kahn, and Wilfred Y Fujimoto. Comparison of a clinical model, the oral glucose tolerance test, and fasting glucose for prediction of type 2 diabetes risk in Japanese Americans. Diabetes Care, 26(3):758–763, 2003.
- Mokdad et al. (2003) Ali H Mokdad, Earl S Ford, Barbara A Bowman, William H Dietz, Frank Vinicor, Virginia S Bales, and James S Marks. Prevalence of obesity, diabetes, and obesity-related health risk factors, 2001. JAMA, 289(1):76–79, 2003.
- Mordelet and Vert (2014) Fantine Mordelet and Jean-Philippe Vert. A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters, 37:201–209, 2014.
- Pan et al. (1997) Xiao-Ren Pan, Guang-wei Li, Ying-Hua Hu, Ji-Xing Wang, Wen-Ying Yang, Zuo-Xin An, Ze-Xi Hu, Jina-Zhong Xiao, Hui-Bi Cao, Ping-An Liu, et al. Effects of diet and exercise in preventing NIDDM in people with impaired glucose tolerance: the Da Qing IGT and Diabetes Study. Diabetes Care, 20(4):537–544, 1997.
- Park et al. (2002) PJ Park, SJ Griffin, L Sargeant, and NJ Wareham. The performance of a risk score in predicting undiagnosed hyperglycemia. Diabetes Care, 25(6):984–988, 2002.
- Pauker (1993) Stephen G Pauker. Deciding about screening. Annals of internal medicine, 118(11):901–902, 1993.
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Rajala et al. (1998) Ulla Rajala, Mauri Laakso, Qing Qiao, and Sirkka Keinänen-Kiukaanniemi. Prevalence of retinopathy in people with diabetes, impaired glucose tolerance, and normal glucose tolerance. Diabetes Care, 21(10):1664–1669, 1998.
- Reis et al. (2011) Jared P Reis, Catherine M Loria, Paul D Sorlie, Yikyung Park, Albert Hollenbeck, and Arthur Schatzkin. Lifestyle factors and risk for new-onset diabetes: a population-based cohort study. Annals of internal medicine, 155(5):292–299, 2011.
- Rubin et al. (1994) Robert J Rubin, William M Altman, and Daniel N Mendelson. Health care expenditures for people with diabetes mellitus, 1992. The Journal of Clinical Endocrinology & Metabolism, 78(4):809A–809F, 1994.
- Schulze et al. (2007) Matthias B Schulze, Kurt Hoffmann, Heiner Boeing, Jakob Linseisen, Sabine Rohrmann, Matthias Möhlig, Andreas FH Pfeiffer, Joachim Spranger, Claus Thamer, Hans-Ulrich Häring, et al. An accurate risk score based on anthropometric, dietary, and lifestyle factors to predict the development of type 2 diabetes. Diabetes Care, 30(3):510–515, 2007.
- Schwarz et al. (2009) Peter EH Schwarz, Jiang Li, Manja Reimann, Alta E Schutte, Antje Bergmann, Markolf Hanefeld, Stefan R Bornstein, Jan Schulze, Jaakko Tuomilehto, and Jaana Lindstrom. The Finnish Diabetes Risk Score is associated with insulin resistance and progression towards type 2 diabetes. The Journal of Clinical Endocrinology & Metabolism, 94(3):920–926, 2009.
- Seltzer (1989) Holbrooke S Seltzer. Drug-induced hypoglycemia. a review of 1418 cases. Endocrinology and metabolism clinics of North America, 18(1):163–183, 1989.
- Shai et al. (2006) Iris Shai, Rui Jiang, JoAnn E Manson, Meir J Stampfer, Walter C Willett, Graham A Colditz, and Frank B Hu. Ethnicity, obesity, and risk of type 2 diabetes in women a 20-year follow-up study. Diabetes care, 29(7):1585–1590, 2006.
- Spijkerman et al. (2004) Annemieke MW Spijkerman, Matthew F Yuyun, Simon J Griffin, Jacqueline M Dekker, Giel Nijpels, and Nicholas J Wareham. The performance of a risk score as a screening test for undiagnosed hyperglycemia in ethnic minority groups data from the 1999 health survey for England. Diabetes Care, 27(1):116–122, 2004.
- Stern et al. (2002) Michael P Stern, Ken Williams, and Steven M Haffner. Identification of persons at high risk for type 2 diabetes mellitus: do we need the oral glucose tolerance test? Annals of Internal Medicine, 136(8):575–581, 2002.
- Tuomilehto et al. (2001) Jaakko Tuomilehto, Jaana Lindström, Johan G Eriksson, Timo T Valle, Helena Hämäläinen, Pirjo Ilanne-Parikka, Sirkka Keinänen-Kiukaanniemi, Mauri Laakso, Anne Louheranta, Merja Rastas, et al. Prevention of type 2 diabetes mellitus by changes in lifestyle among subjects with impaired glucose tolerance. New England Journal of Medicine, 344(18):1343–1350, 2001.
- Turner et al. (1999) Robert C Turner, Carole A Cull, Valeria Frighi, Rury R Holman, UK Prospective Diabetes Study (UKPDS) Group, et al. Glycemic control with diet, sulfonylurea, metformin, or insulin in patients with type 2 diabetes mellitus: progressive requirement for multiple therapies (UKPDS 49). JAMA, 281(21):2005–2012, 1999.
- Van den Berghe et al. (2001) Greet Van den Berghe, Pieter Wouters, Frank Weekers, Charles Verwaest, Frans Bruyninckx, Miet Schetz, Dirk Vlasselaers, Patrick Ferdinande, Peter Lauwers, and Roger Bouillon. Intensive insulin therapy in critically ill patients. New England journal of medicine, 345(19):1359–1367, 2001.
- Van den Berghe et al. (2006) Greet Van den Berghe, Alexander Wilmer, Greet Hermans, Wouter Meersseman, Pieter J Wouters, Ilse Milants, Eric Van Wijngaerden, Herman Bobbaers, and Roger Bouillon. Intensive insulin therapy in the medical ICU. New England Journal of Medicine, 354(5):449, 2006.
- Van den Oever and Volckaert (2008) R Van den Oever and C Volckaert. Financing health care in Belgium. the nomenclature: from fee-for-service to budget-financing. Acta chirurgica Belgica, 108(2):157, 2008.
- Van Der Walt et al. (2011) Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The NumPy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22–30, 2011.
- Varma and Simon (2006) Sudhir Varma and Richard Simon. Bias in error estimation when using cross-validation for model selection. BMC bioinformatics, 7(1):91, 2006.
- Wareham and Griffin (2001) Nicholas J Wareham and Simon J Griffin. Should we screen for type 2 diabetes? evaluation against national screening committee criteria. BMJ: British Medical Journal, 322(7292):986, 2001.
- WHO Collaborating Centre for Drug Statistics Methodology (2015) WHO Collaborating Centre for Drug Statistics Methodology. Guidelines for ATC classification and DDD assignment. World Health Organization, 2015.
- World Health Organization et al. (1994) World Health Organization et al. Prevention of diabetes mellitus: report of a WHO study group [meeting held in geneva from 16 to 20 november 1992]. (WHO technical report number 844), 1994.
- World Health Organization et al. (2012) World Health Organization et al. International classification of diseases (ICD). 2012.
- Yu (2005) Hwanjo Yu. Single-class classification with mapping convergence. Machine Learning, 61(1-3):49–69, November 2005. ISSN 0885-6125.
- Zammitt and Frier (2005) Nicola N Zammitt and Brian M Frier. Hypoglycemia in type 2 diabetes pathophysiology, frequency, and effects of different treatment modalities. Diabetes Care, 28(12):2948–2961, 2005.
- Zimmet et al. (2001) Paul Zimmet, KGMM Alberti, and Jonathan Shaw. Global and societal implications of the diabetes epidemic. Nature, 414(6865):782–787, 2001.