In biomarker studies it is popular to perform an unsupervised clustering of high-dimensional variables like genome wide screens of SNPs, gene expressions, and protein data and regress for example treatment response, patient recorded outcome measures, time to disease progression, or overall survival on these potentially mislabelled clusters. It is well-known from the statistical literature that errors in continuous and categorical covariates can lead to loss of important information about effects on outcome (Carroll et al., 2006)
. However, to our surprise this is often ignored when regressing outcome on classes identified by unsupervised learning, which might lead to important clinical effect measures being overlooked(Alizadeh et al., 2000; Veer et al., 2002; Guinney et al., 2015; Zhan et al., 2006; Broyl et al., 2010). We suggest to cast the problem as a covariate misclassification problem. This leaves us with a concourse of possible modelling and analysis options, see for example the book by Carroll et al. (2006) or the recent review by Brakenhoff et al. (2018). A general approach, with good statistical properties, is to maximize the likelihood of a latent variable model joining the regression and classification models (Nevo et al., 2016; Skrondal and Rabe-Hesketh, 2004).
First, this process does not mimic the workflow of the biologists, for whom the basic question is to identify important biological processes and next to relate the clusters to clinical consequences. Secondly, upon parameter estimation cluster membership needs post hoc to be estimated by e.g. the maximum a posteriori probability, whereby direct connection to the regression parameter is lost. Thirdly, this approach requires a statistical model of the clustering process, leaving out the possibility to combine popular unsupervised clustering algorithms, such as hierarchical clustering, with popular parametric regression models like generalized linear models and Cox’s proportional hazards model.
Due to its generality, we chose to study the two-stage modelling process. A number of ad-hoc methods have been developed in various specific settings which could be adapted to this situation. Examples include matrix methods (Morrissey and Spiegelman, 1999), regression calibration (Rosner et al., 1989), pooled estimation (Spiegelman et al., 2001)
, multiple imputation(Cole et al., 2006), corrected score estimation (Nakamura, 1990), and simulation and extrapolation (simex) (Cook and Stefanski, 1994; Carroll et al., 1996, 1999). Among these methods, simex has become a useful tool for correcting effect estimates in the presence of additive measurement error. The simex idea has been extended to the misclassification simulation and extrapolation (mcsimex) method for correcting effect estimates in the presence of errors in categorical covariates (Küchenhoff et al., 2006). In this paper, we chose to focus on mcsimex because of its generality, simplicity, and direct applicability.
The article is organized as follows. In Section 2, we detail the underlying hierarchical model (Section 2.1), describe how unsupervised learning and statistical inference are performed (Section 2.2), and outline the mcsimex method (Section 2.3). Simulation studies in Section 3 show the performance of our proposal. An application to a study on unsupervised clustering of gene expression data from bone marrow samples of multiple myeloma patients is given in Section 4. The results of the paper is discussed in Section 5 followed by computational details of the the mcsimex method in Section 6.
2 Statistical setting
2.1 The hierarchical model
The -component Gaussian mixture model (GMM) of has the following distribution
where corresponds to the class and are the mixture proportions satisfying . Thus, the GMM is parameterized by
We denote the possibly misspecified variable for the corresponding correctly specified , i.e.
Thus, the misclassification error is characterized by the misclassification matrix , which is defined by its components
The outcome is here modelled as a generalized linear model
where , is encoded by treatment contrasts, is the link function, and is assumed to be generated from an exponential family distribution (McCullagh and Nelder, 1989).
2.2 Statistical inference
Assume we have i.i.d. realizations of and assume we want an estimate of the effect sizes of each of the unobserved components. First, we estimate the parameters of the Gaussian mixture model by maximizing the log-likelihood function to obtain
Now, we can estimate the class relationship for any covariate by the following hard clustering rule
where is the density of the -distribution and for the observed covariate we write
for short. The misclassification matrix can be consistently estimated in the following way
The next step is to optimize the log-likelihood of the generalized linear model based on the assumed i.i.d. data , to obtain the naive estimate (McCullagh and Nelder, 1989). The procedure outlined above will in the following be referred to as the naive method.
It is important to notice that maximum likelihood estimation under the naive method is estimation of a misspecified model. Convergence of estimates under misspecified models is ensured by the results of White (1982). In general we will denote the limit of a maximum likelihood estimate by for a model misspecified by the misclassification matrix .
2.3 The mcsimex method
It is well-known that estimating by the naive method leads to a biased estimate (Küchenhoff et al., 2006). In order to formulate the mcsimex method to redress this situation, we define the function by
where can be expressed as via the spectral decomposition, with
being the diagonal matrix of eigenvalues and
the corresponding matrix of eigenvectors. For the function (8) to be well-defined, we need to ensure the existence of and that it is a misclassification matrix for . Criteria for existence are given in Küchenhoff et al. (2006).
We notice that parameterizes the amount of misclassification, where corresponds to no misclassfication, corresponds to the present misclassification, and for , corresponds to increasing misclassification. The fundamental idea behind mcsimex is to simulate for increasing and then extrapolate back to .
In a few situations explicit forms of as a function of can be calculated, but they tend to be unstable to estimate, wherefore mcsimex relies on finite-dimensional parametric approximations , , of . One example from the R-package simex is the quadratic approximation (Lederer et al., 2017). It is custom in the mcsimex litterature to assume one either has the misclassification matrix at hand or it can be estimated from training data. However, in our case the misclassification matrix will be (consistently) estimated from data in the following way
The mcsimex method (or algorithm) can now be formulated as:
Algorithm (Küchenhoff et al., 2006)
For a fixed grid of values
, simulate i.i.d. random variables (condition upon
where , , , and is the ’th column of .
Estimate by the least squares method
where and is the naive estimator.
The mcsimex estimate is then given by the extrapolation to
3 Simulation studies
3.1 Logistic regression
In this section, we investigate empirically how different sample sizes, imbalances between classes, and clustering algorithms affect the effect estimates. For the simulations, we generate , , or
independent samples from a 2-class Gaussian mixture model, where the prior probabilities of class 1 isor to generate balanced or imbalanced classes, respectively, and set
. Class 1 and 2 observations have bivariate normal distributions with meansand
respectively, and a common identity covariance matrix. The outcome is modelled by a logistic regression with linear predictorwhere the intercept and class effect are given by .
Unsupervised clustering was performed using either Gaussian mixture models as implemented in the R-package mcclust (Scrucca et al., 2016), or -means from the base implementation in R. We estimated the parameter using either the true class labels, class labels inferred from unsupervised clustering, and the mcsimex method. The misclassification matrix for the mcsimex approach was inferred by drawing
bivariate normal samples using the means and variance-covariance matrix from each of the inferred clusters and counting how often they were misclassified by the fitted clustering model.
Each scenario was repeated 1000 times and the results are summarized by bias, mean-square error, and coverage of the estimated confidence intervals. Results from the balanced scenarioare shown in Table 1. We see that using the true class labels there is a small bias in both of the estimated parameters and the coverage is close to
. When using the estimated class labels inferred from either the GMM or K-means clustering without taking misclassification into account, i.e. the naïve model, both parameters are biased and coverage is far from the assumed
. The bias is not alleviated by increasing the number of samples, the coverage is however smaller, due to a smaller standard error. When taking misclassification into account, using the simex approach, both bias and coverage of the parameters is improved, and the improvement is similar across sample sizes. Results from the imbalanced scenario are shown in Table2. These also show smaller bias and better coverage for the simex model. Results are, however, better when using GMM than -means. The -means algorithm tends to estimate clusters of uniform size (Hui Xiong et al., 2009), leading to poor performance with imbalanced clusters, and since we used the fitted -means clusters to infer the misclassification matrix this also becomes misspecified and the simex approach cannot fully correct for the added noise. Some work has been done to alleviate this bias in the -means algorithm, e.g. using multicenters (Liang et al., 2012) or undersampling (Kumar et al., 2014). We did not pursue this further in the present paper, but just notice that one might instead use the GMM to infer clusters in the imbalanced case.
3.2 Cox Proportional Hazards
In survival analysis the Cox proportional hazards model is often used as the model of choice for inferring the impact of covariates on the rate of events. The R package simex, used for this paper, did not previously support this model, which poses a problem for using mcsimex in survival analysis. This can be circumvented by using the Poisson approximation as done in Bang et al. (2013), but we chose instead to augment the source code of the simex package to include the coxph model class from the survival package (Therneau, 2015; Therneau and Grambsch, 2000).
To test the performance of our implementation we performed a second round of simulations. Here we simulated , or samples, where class labels, , were drawn from a binomal distribution with parameter and added misclassification noise at a rate of , or
(off diagonal values in the misclassification matrix). Survival times were drawn from an exponential distribution with parameterand censoring times were drawn from and exponential distribution with parameter . The results from 1000 simulations of each scenario are shown in Table 3. Results confirm that the simex method also reduces biases in the estimated parameters from the Cox proportional hazards model. However, the improvement in bias as well as the coverage is better for lower rates of misclassification. The coverage indicates that standard errors for the estimates are too small, which is likely caused by estimating them with the jackknife variance estimator in the simex package. This has previously been shown to underestimate the variance, so using an asymptotic variance estimate is preferred (Küchenhoff et al., 2007). An asymptotic variance estimate for the Cox proportional hazards model, incorporating misclassificaion, has to the best of our knowledge not yet been derived, but the problem can be alleviated by using a bootstrap approach to estimate the variance, which also adds the possibility to include additional variance resulting from estimation of the misclassification matrix (Küchenhoff et al., 2007). However, the bootstrap approach drastically increases computational cost as the simex model has to be fitted for each bootstrap sample.
|mp = 0.1||mp = 0.2||mp = 0.3|
4 Cancer sub-classification
Since the invention of high dimensional gene expression profiling, just before this millennium, a popular task has been to perform unsupervised cluster analysis on such data to identify new subgroups and correlate these subgroups to biological information, clinical data, and outcome. In this paper, we consider an example from sub-classification of multiple myleoma (MM). MM is a malignancy of end stage B cells that expand in the bone marrow, resulting in anemia, bone destruction, and renal failure. The data set contains 414 gene expressions sets from multiple myeloma patient’s bone marrow(Zhan et al., 2006). The gene expressions were profiled on Affymetrix HGU133 Plus 2 arrays and exported to .CEL-files by the Affymetrix Genomics Console. To replicate the analysis of Zhan et al. (2006) we downloaded raw .CEL files from the GEO repository GS24080 and matched these by patient IDs to cases included in Zhan et al. (2006), and were able to match 407 out of the 414 cases. This was done since raw data were not available in the repository indicated in the original paper, so we resorted to a later study from the same group. Data were MAS5 normalized and filtered according to instructions from Zhan et al. (2006) resulting in a dataset with gene expressions for 2,169 genes. This is a higher number of genes than the 1,559 reported in the original paper, and is possibly caused by the different number of samples and/or slight differences in the MAS5 normalization procedure as implemented in the R package affy (Gautier et al., 2004) and the Affymetrix Microarray Suite GCOS 1.1.
After filtering, they performed hierarchical clustering on the remaining genes and chose to cut the tree, so seven clusters were formed. In order to mimic their analysis flow we identified seven sub-groups by estimating a 7-component GMM. The classes were labelled according to the most similar class from Zhan et al. (2006)
as shown in the confusion matrix in Table4. The classes estimated from the GMM had an accuracy of 0.9 compared to the original classes. Survival curves are shown for the original classes and GMM classes in respectively panels A and B of Figure 1. These data confirm, that a Gaussian mixture model reasonably well approximates the hierarchical clustering of Zhan et al. (2006).
The 7 classes in Zhan et al. (2006) were split into a low risk group consisting of the CD1, CD2, HY, and LB classes, and a high risk group with the MF, MS, and PR classes with survival differences as shown in panel C of Figure 1 for the original classes and panel D for the GMM classes. A Cox proportional hazards model was employed to estimate the hazard ratio of high vs low risk group, but this analysis did not take any possible misclassification of the inferred classes from the unsupervised clustering into account. To investigate the impact of correction for misclassification we applied the mcsimex method to the data at hand. As shown in the simulation results the variance estimates of the built-in jackknife estimate are underestimated so we chose to do the analysis using bootstraps as well. We performed 1000 bootstraps iterations where at each step a sample of size was drawn with replacement from the available data. A GMM was fitted to the sample and the out-of-bag samples was used to infer the misclassification matrix by comparing the predicted class from the in-bag GMM model to the class obtained from the full data. A Cox proportional hazards model was then fitted to the in-bag sample and the mcsimex model was applied using the estimated misclassification matrix. By estimating the misclassification matrix at each iteration of the bootstrap procedure we factor in the added variance from its estimation. Results from this procedure, compared to the naive estimate along with results from using mcsimex with an average misclassification matrix from 1000 bootstraps, but only fitting the mcsimex model once, are shown in Table 5. For the average misclassification matrix we observed a misclassification probability of for the low risk group, and for the high risk group. For the naïve models we see similar results for the original classes from Zhan et al. (2006) and the refitted classes from the GMM, and both approaches for correction for misclassification gives a higher point estimate for the hazard ratio of high vs low risk, confidence intervals are, however, wide and overlapping.
|HR||Lower 95||Upper 95||P-value|
|Zhan - Naïve|
|GMM - Naïve|
|GMM - Simex (Average MC)|
|GMM - Simex (Full bootstrap)|
In this paper, we documented a bias on effect estimates when regressing on misclassfied labels arising from unsupervised learning. We also suggested a workflow for adjusting the effect estimates based on the mcsimex method. We had to extend existing software to appropriately handle regression based on time to event outcome. The effectiveness of the workflow was documented on simulated data and we also shed new light on bias and variance of effect estimates in an existing cancer sub-classification study.
The strength of the suggested workflow is essentially its general applicability and seemingly robustness. However, these advantages come at the cost of computational intensiveness of the Monte Carlo approach in mcsimex.
As mentioned in Section 1
there exists a number of alternative methods to handle misclassifed labels in regression models. Latent variable models seem most interesting as they are built on parametric models and maximum likelihood estimation.Nevo et al. (2016) is the only paper we have come across dealing with the misclassification problem arising from unsupervised learning. In this paper, though, a slightly different set-up is studied, as they formulate a latent variable model for class risk given measured class features and additional covariates.
In the light of our results, we encourage researchers to adjust for bias when regressing on potentially misclassified labels. We also encourage biomarker researchers to re-visit previous studies, especially those, which led to negative results when regressing upon misclassified labels.
6 Computational details
All simulations and analyses were carried out by the statistical programming language R, using the mcsimex function from the simex package as the main vehicle (Lederer et al., 2017). This function contains functionality for mcsimex correction of regressions from the lm, glm, gam, nls, polr, lme, and nlme functions. For the current study we extended the package to accomodate Cox proportional hazards regression via the coxph function of the survival package (Therneau and Grambsch, 2000; Therneau, 2015). This extension was made by forking the source code of the simex package from https://github.com/cran/simex. The adapted code has been included in the simex package and is available at https://cran.r-project.org/package=simex
We utilized a number of other R and Bioconductor packages, notably the mclust package for clustering (Scrucca et al., 2016). For a complete list of packages see the Rmarkdown document available at https://github.com/HaemAalborg/misClass, which details all steps in the analyses carried out in this paper.
Part of the work was done while Martin Bøgsted visited University of Cambridge. He wants to thank Silvia Richardson, Rajen Shah, and Richard Samworth for their comments on the work and hospitality during the visit and the North Denmark Region’s Health Scientific Research Fund and the Lundbeck Foudation for financial support.
- Alizadeh et al. (2000) Alizadeh, A. A., Eisen, M. B., Davis, R. E., Ma, C., Lossos, I. S., Rosenwald, A., Boldrick, J. C., Sabet, H., Tran, T., Yu, X., Powell, J. I., Yang, L., Marti, G. E., Moore, T., Hudson, J., Lu, L., Lewis, D. B., Tibshirani, R., Sherlock, G., Chan, W. C., Greiner, T. C., Weisenburger, D. D., Armitage, J. O., Warnke, R., Levy, R., Wilson, W., Grever, M. R., Byrd, J. C., Botstein, D., Brown, P. O., and Staudt, L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503–11.
- Bang et al. (2013) Bang, H., Chiu, Y.-L., Kaufman, J. S., Patel, M. D., Heiss, G., and Rose, K. M. (2013). Bias Correction Methods for Misclassified Covariates in the Cox Model: Comparison of Five Correction Methods by Simulation and Data Analysis. Journal of Statistical Theory and Practice, 7(2):381–400.
- Brakenhoff et al. (2018) Brakenhoff, T. B., Mitroiu, M., Keogh, R. H., Moons, K. G., Groenwold, R. H., and van Smeden, M. (2018). Measurement error is often neglected in medical literature: a systematic review. Journal of Clinical Epidemiology, 98:89–97.
- Broyl et al. (2010) Broyl, A., Hose, D., Lokhorst, H., de Knegt, Y., Peeters, J., Jauch, A., Bertsch, U., Buijs, A., Stevens-Kroef, M., Berna Beverloo, H., Vellenga, E., Zweegman, S., Kersten, M.-J., van der Holt, B., el Jarari, L., Mulligan, G., Goldschmidt, H., van Duin, M., and Sonneveld, P. (2010). Gene expression profiling for molecular classification of multiple myeloma in newly diagnosed patients. Blood, 116(14):2543–2553.
- Carroll et al. (2006) Carroll, R., Ruppert, D., Stefanski, L., and Crainiceanu, C. (2006). Measurement Error in Nonlinear Models. Chapman & Hall/CRC, Boca Raton, Florida, second edition.
- Carroll et al. (1996) Carroll, R. J., Küchenhoff, H., Lombard, F., and Stefanski, L. A. (1996). Asymptotics for the SIMEX estimator in nonlinear measurement error models. Journal of the American Statistical Association, 91(433):242–250.
- Carroll et al. (1999) Carroll, R. J., Maca, J. D., and Ruppert, D. (1999). Nonparametric regression in the presence of measurement error. Biometrika, 86(3):541–554.
- Cole et al. (2006) Cole, S. R., Chu, H., and Greenland, S. (2006). Multiple-imputation for measurement-error correction. International Journal of Epidemiology, 35(4):1074–1081.
- Cook and Stefanski (1994) Cook, J. R. and Stefanski, L. A. (1994). Simulation-extrapolation estimation in parametric measurement error models. Journal of the American Statistical Association, 89(428):1314–1328.
- Gautier et al. (2004) Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. (2004). affy—analysis of affymetrix genechip data at the probe level. Bioinformatics, 20(3):307–315.
- Guinney et al. (2015) Guinney, J., Dienstmann, R., Wang, X., de Reyniès, A., Schlicker, A., Soneson, C., Marisa, L., Roepman, P., Nyamundanda, G., Angelino, P., Bot, B. M., Morris, J. S., Simon, I. M., Gerster, S., Fessler, E., De Sousa E Melo, F., Missiaglia, E., Ramay, H., Barras, D., Homicsko, K., Maru, D., Manyam, G. C., Broom, B., Boige, V., Perez-Villamil, B., Laderas, T., Salazar, R., Gray, J. W., Hanahan, D., Tabernero, J., Bernards, R., Friend, S. H., Laurent-Puig, P., Medema, J. P., Sadanandam, A., Wessels, L., Delorenzi, M., Kopetz, S., Vermeulen, L., and Tejpar, S. (2015). The consensus molecular subtypes of colorectal cancer. Nature Medicine, 21(11):1350–156.
- Hui Xiong et al. (2009) Hui Xiong, Junjie Wu, and Jian Chen (2009). K-Means Clustering Versus Validation Measures: A Data-Distribution Perspective. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):318–331.
- Küchenhoff et al. (2007) Küchenhoff, H., Lederer, W., and Lesaffre, E. (2007). Asymptotic variance estimation for the misclassification SIMEX. Computational Statistics & Data Analysis, 51(12):6197–6211.
- Küchenhoff et al. (2006) Küchenhoff, H., Mwalili, S. M., and Lesaffre, E. (2006). A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics, 62(1):85–96.
Kumar et al. (2014)
Kumar, N. S., Rao, K. N., Govardhan, A., Reddy, K. S., and Mahmood, A. M.
Undersampled $$K$$ K -means approach for handling
imbalanced distributed data.
Progress in Artificial Intelligence, 3(1):29–38.
- Lederer et al. (2017) Lederer, W., Seibold, H., and Küchenhoff, H. (2017). simex: SIMEX- and MCSIMEX-algorithm for measurement error models.
- Liang et al. (2012) Liang, J., Bai, L., Dang, C., and Cao, F. (2012). The -Means-Type Algorithms Versus Imbalanced Data Distributions. IEEE Transactions on Fuzzy Systems, 20(4):728–745.
- McCullagh and Nelder (1989) McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models. Chapman & Hall/CRC, Boca Raton, Florida, second edition.
Morrissey and Spiegelman (1999)
Morrissey, M. J. and Spiegelman, D. (1999).
Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons.Biometrics, 55(2):338–344.
- Nakamura (1990) Nakamura, T. (1990). Corrected score function for errors-in-variables models: methodology and application to generalized linear models. Biometrika, 77(1):127–137.
- Nevo et al. (2016) Nevo, D., Zucker, D. M., Tamimi, R. M., and Wang, M. (2016). Accounting for measurement error in biomarker data and misclassification of subtypes in the analysis of tumor data. Statistics in Medicine, 35(30):5686–5700.
- Rosner et al. (1989) Rosner, B., Willett, W. C., and Spiegelman, D. (1989). Correction of logistic regression relative risk estimates and confidence intervals for systematic within-person measurement error. Statistics in Medicine, 8(9):1051–1069.
- Scrucca et al. (2016) Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016). mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):205–233.
- Skrondal and Rabe-Hesketh (2004) Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized latent variable modeling : multilevel, longitudinal, and structural equation models. Chapman & Hall/CRC, Boca Raton, Florida.
- Spiegelman et al. (2001) Spiegelman, D., Carroll, R. J., and Kipnis, V. (2001). Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Statistics in Medicine, 20(1):139–160.
- Therneau (2015) Therneau, T. M. (2015). A package for survival analysis in S.
- Therneau and Grambsch (2000) Therneau, T. M. and Grambsch, P. M. (2000). Modeling Survival Data: Extending the Cox Model. Springer, New York.
- Veer et al. (2002) Veer, L. J. V., Dai, H., Vijver, M. J. V. D., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530–536.
- White (1982) White, H. (1982). Maximum likelihood of misspecified models. Econometrica, 50(1):1–25.
- Zhan et al. (2006) Zhan, F., Huang, Y., Colla, S., Stewart, J. P., Hanamura, I., Gupta, S., Epstein, J., Yaccoby, S., Sawyer, J., Burington, B., Anaissie, E., Hollmig, K., Pineda-Roman, M., Tricot, G., van Rhee, F., Walker, R., Zangari, M., Crowley, J., Barlogie, B., and Shaughnessy, J. D. (2006). The molecular classification of multiple myeloma. Blood, 108(6):2020–8.