Efficient selection of predictive biomarkers for individual treatment selection

05/05/2019 ∙ by Shonosuke Sugasawa, et al. ∙ The University of Tokyo 0

The development of molecular diagnostic tools to achieve individualized medicine requires identifying predictive biomarkers associated with subgroups of individuals who might receive beneficial or harmful effects from different available treatments. However, due to the large number of candidate biomarkers in the large-scale genetic and molecular studies, and complex relationships among clinical outcome, biomarkers and treatments, the ordinary statistical tests for the interactions between treatments and covariates have difficulties from their limited statistical powers. In this paper, we propose an efficient method for detecting predictive biomarkers. We employ weighted loss functions of Chen et al. (2017) to directly estimate individual treatment scores and propose synthetic posterior inference for effect sizes of biomarkers. We develop an empirical Bayes approach, namely, we estimate unknown hyperparameters in the prior distribution based on data. We then provide the efficient ranking and selection method of the candidate biomarkers based on this framework with adequate control of false discovery rate. The proposed model is demonstrated in simulation studies and an application to a breast cancer clinical study in which the proposed method was shown to detect the much larger numbers of significant biomarkers than the current standard methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the advances of disease biology, it has been revealed that there is substantial molecular heterogeneity among individual patients in many diseases. This implies that the benefits and harms of many treatments might also be heterogeneous, and accurate molecular diagnostic methods could maximize the treatment benefits for individual patients (Gabriel and Normand, 2012). To develop the molecular diagnostic tools, a key task is the identification of predictive biomarkers associated with subgroups of individuals who might receive beneficial or harmful effects from different available treatments (Matsui et al., 2015; Lipkovich et al., 2017). It is typically explained by interactions between the treatment and candidate biomarkers, but the conventional interaction tests have substantial difficulties for these analyses because of their serious limitations of statistical powers (Matsui et al., 2018).

On the other hand, efficient estimating methods of individual treatment effects (ITE) have been also developed for large-scale genetic and molecular studies. Specifically, there is a growing number of literatures regarding the efficient estimation of ITE (e.g. Kehl and Ulm, 2006; Tian et al., 2014; Chen et al., 2017; Lu et al., 2018; Wager and Athey, 2018; Zhang et al., 2017) among many others. Although these methods can effectively estimate ITE, the estimated model is typically too complicated, and it would be difficult to understand which biomarkers are actually associate with ITE. Besides, the regularization methods often provide us not only the estimates of model parameters but also selection results of biomarkers included in the estimated model (Lu et al., 2013; Zhang and Zhang, 2018), but the regularization method do not necessarily guarantee the control of the degree of false discoveries, i.e., false discovery rate (Benjamini and Hochberg, 1995). Since the primary purpose of large-scale genetic and molecular studies is screening relevant candidate markers, the prioritization and assessment of the accuracy of selections are relevant tasks.

In this article, we propose an efficient method for screening predictive biomarkers associated with ITE under control of FDR. We employ the weighted loss function proposed by Chen et al. (2017) designed to directly estimate ITE score without explicitly specifying the main effect, and construct synthetic likelihood for effect sizes of candidate biomarkers. The synthetic likelihood is then combined with a latent semiparametric distribution of true effect sizes, which enables us derive the synthetic posterior distribution for effect sizes. The underlying semiparametric distribution of effect sizes can be estimated by the EM algorithm (Dempster et al., 1977), and we propose the optimal discovery procedure (ODP) to detect predictive biomarkers with adequate control of FDR. Through simulation studies, we numerically show that the proposed method can detect much more true predictive biomarkers than standard methods. Moreover, we apply the proposed method to an observational study of breast cancer, and we found that the proposed methods were able to detect a large number of predictive biomarkers whereas the standard methods detected less than 2 even when FDR is allowed to be . This paper is organized as follows. In Section 2, we describe the proposed method and demonstrate the optimal discovery procedure. In Section 3, we conducted simulation studies to confirm the effectiveness of the proposed method compared with conventional testing approaches. We then apply the proposed method to a dataset of a breast cancer study in Section 4. Finally, we give some discussions in Section 5.

2 Proposed Method

2.1 Notations and assumptions

We adopt the notation based on the potential outcome framework in causal inference (Rubin, 2005). Let be the treatment indicator with indicating a patient being treated and indicating the opposite. Let denote the potential outcome indicating the response of a patient (e.g., survival time, binary disease status) if the patient receives treatment . In practice, only one of the potential outcomes and can be observed for each patient, that is, , where is the indicator function. We employ strongly ignorable assumption (Rosenbaum and Rubin, 1983; Rubin, 2005), that is, is independent of given the -dimensional covariates

of potential predictive biomarkers of the treatment (e.g., genotypes, gene expressions). For the treatment assignment, we assume that probability of treatment assignment is a function of

, that is, , where is known as propensity score, and under randomized clinical trial or needs to be estimated (e.g. via regression modeling) in observational studies. We first assume that is known, and provide discussions on estimating in Section 4. The observed data consists of independent identically distributed copies of .

2.2 Weighted loss function and synthetic posterior

Our goal is to detect biomarkers among that are associated with individual treatment effect (ITE) denoted by , e.g. . Traditional approaches for estimating

use parametric models including main effect (function only of

) and interaction effect (function of both and ), thereby the estimation of main effect (nuisance part) may lead to inefficient estimation of the interaction effect. To avoid the problem, we here employ the weighted loss function (Chen et al., 2017) given by


where is the propensity score, and is a loss function. As noted in Chen et al. (2017), there is a one-to-one correspondence between the choice of and measure of treatment effect. For example, under , the minimizer of the true risk is , which is difference of expectations between treatment and opposite groups.

Now we consider a linear regression model for the

th biomarker: , where is the biomarker-specific intercept and is the effect size of the th biomarker. Then, the weighted loss function (1) is reduced to


To define synthetic posterior distribution of effect sizes ’s, we define synthetic likelihood based on the loss function (2) given by


where is a scaling constant such that , and the maximizer of is the same as the minimizer of (2). Note that can be interpreted as the weight for the th observation such that . From (3), can be interpreted as interaction effect of and , so that means that the th biomarker is irrelevant to ITE; otherwise, the th biomarker is a predictive biomarker. Hence, we consider statistical testing whether is zero or not for each . To this end, we first note that is a nuisance parameter in (2). For constructing (profile) synthetic likelihood of from (2), we consider the following two methods:

  • (Plug-in method).   We first compute the point estimator of obtained by maximizing the synthetic likelihood function (2), and replace in (2) with .

  • (Normal-approximation method).   We compute the mode and the inverse value of Hessian at the mode, , of the synthetic likelihood function (2) with respect to , and define the synthetic likelihood function of as .

The synthetic likelihood of will be denoted by . Since information from individuals is summarized in and , the normal approximation method have computational advantage compared to the plug-in method which needs to compute summation over individuals. However, the normal approximation might be poor when is not large, which may lead to less power than the plug-in method.

We consider the multiple hypothesis tests, H vs H. In order to express the null and non-null biomarkers, we introduce the following latent structure for the effect sizes:



is the prior probability of being null, that is,

, and the functions and represent the distributions of the null and non-null biomarkers, respectively. For null biomarkers, we use the one-point distribution on , that is, , where represent the one-point distribution on . The form of is not specified, so that the latent structure of can be seen as a semiparametric model where a nonparametric distribution is assumed for non-null biomarkers. Combined with the profile likelihood and the prior (4), we define the following synthetic posterior distribution of :


which are independent for .

The synthetic posterior distribution (5) have hyperparameters and . We consider an empirical Bayes approach, that is, we estimate these parameters from the synthetic marginal likelihood:


We employ the smoothing-by-roughening approach (Shen and Louis, 1999) in which the nonparametric estimate of is supported by fixed discrete mass points, that is, we approximate as with mixing probabilities ’s such that , and fixed knots . Then, the above synthetic likelihood can be efficiently maximized by an EM algorithm (Dempster et al., 1977) whose details are provided in Appendix.

2.3 Biomarker-specific indices

Some biomarker-specific indices are useful for screening biomarkers. Let be the indicator variable for null/non-null status for the th biomarkers, such that if the th biomarker is non-null and otherwise. The value of is unknown and has the prior probability . From the synthetic posterior (5

), we can compute the posterior probability of being non-null as

where the integral can be expressed as summation over the discrete mass points used to estimate . By plugging in the hyperparameter estimates of and , the estimated posterior probability of being non-null can be obtained. Other biomarker-specific indices (e.g. posterior mean of ) can also be estimated based on the synthetic posterior distribution (5).

2.4 Screening biomarkers based on the optimal discovering procedure

For ranking biomarkers, we apply the optimal discovery procedure (Storey, 2007; Storey et al., 2007), and use the following statistic:

which is the model-based version of the optimal discovery statistic (Cao et al., 2009; Noma and Matsui, 2012, 2013). Since the optimal discovery procedure is known to provide maximum expected number of significant biomarkers under given a certain false discovery rate (Noma and Matsui, 2012), we may efficiently detect predictive biomarkers by using the statistic . We select a set of of biomarkers whose ODS values are equal to or greater than . To estimate the number of false positives in the selected set, we oblate a false discovery rate (FDR) based on the fitted model (e.g. McLachlan et al., 2006).

where represents the size of , the number of selected biomarkers. Note that the summation of (posterior probability of being null) represents the expected number of null (false-positive) biomarkers in . For specified value of , we may compute FDR, so that we may compute the reasonable value of under user-specified FDR (e.g. 5%).

3 Simulation study

We here assessed the performance the proposed methods together with some alternative methods. We considered two types of responses: binary and censored survival responses. We set (the number of samples) and (the number of candidate biomarkers) throughout this study. We generated the covariates

from a mean zero multivariate normal distribution with covariance matrix whose

-element is . The treatment indicator

was indecently generated from a Bernoulli distribution with probability

, where .

We first consider the case with binary response. We generated independent samples from the following model:

where . For the true parameter values of the main effect, we set , , , , and the other values of ’s and ’s were set to . For effect sizes in the interaction term, we consider the following generating distribution

so that the distribution of consists of three components, one-point distribution on representing null effect, two normal distributions with positive and negative means representing biomarkers that have positive and negative values on the individual treatment effects, respectively. We considered two cases of null probability of and . For the simulated dataset, we applied the proposed ODP methods with the use of negative log-binomial likelihood function for the loss function. As explained in Section 2, we considered two approaches for constructing the synthetic profile likelihood: the plug-in method and the normal-approximation method, which are denoted by ODP-P and ODP-N, respectively. For estimating unknown non-null distribution, we adopted knots

. As conventional screening methods, we used the standardized test statistics for the

th biomarkers:


where and

are estimate and standard error based on the single logistic regression on

th biomarker in treatment and control groups. The FDR for these tests were estimated using the method by Storey and Tibshirani (2003).

We next consider the case with survival responses. We generated independent survival times from the following regression model:


and all the model parameters were the same as those in the previous parts. The censoring time was generated from the uniform distribution

, which induced censoring survival time . The censoring rate was around in each simulated dataset. For the simulated dataset, we applied the proposed two ODP methods (ODP-P and ODP-N) with the use of negative log Cox partial likelihood function for the loss function. As competitors, we adopted the same form of test statistics as (7), where and are obtained from Cox regression with only th biomarker in treatment and control groups.

Based on 200 simulations, we calculated the average number of significant biomarkers and true positives at FDR5, 10, 15, or 20% for three types of responses, which are summarized in Table 1. Overall, the proposed methods detected the large numbers of biomarkers than the standard methods, which clearly shows the efficiency of the propose methods. Moreover, the numbers of true positives of the proposed methods are much larger than the standard methods when we allow larger FDR. Also the efficiency gain of the proposed method emerged as (null probability) gets smaller. Comparing the two proposed methods, ODP-P tends slightly more efficient than ODP-N possibly because the approximation error of the normal approximation used in ODP-N may not be negligible and it may reduce the power of ODP-N.

# significant biomarkers # true positive
FDR levels 5% 10% 15% 20% 5% 10% 15% 20%
Binary ()
ODP-P 1.29 5.74 13.38 24.49 1.14 4.70 10.44 17.86
ODP-N 0.85 4.21 10.74 20.68 0.76 3.56 8.47 15.45
0.52 1.41 2.97 5.04 0.47 1.13 2.32 3.90
0.68 1.63 3.42 5.65 0.60 1.34 2.71 4.33
Binary ()
ODP-P 1.00 6.51 18.45 38.67 0.90 5.54 15.04 30.66
ODP-N 0.62 4.84 15.05 33.38 0.58 4.18 12.32 26.63
0.37 1.05 2.15 3.71 0.33 0.93 1.86 3.12
0.52 1.10 2.33 4.11 0.45 0.99 1.97 3.37
Survival ()
ODP-P 3.89 13.37 26.74 44.13 3.45 11.20 21.15 33.12
ODP-N 3.46 12.19 24.77 41.37 3.15 10.24 19.59 31.23
1.51 3.79 7.38 12.67 1.34 3.28 5.99 9.80
1.35 3.32 6.49 11.69 1.23 2.88 5.35 9.11
Survival ()
ODP-P 1.49 10.48 29.73 61.36 1.33 9.06 24.84 49.46
ODP-N 1.07 8.78 25.96 54.67 0.96 7.61 21.71 44.25
0.52 1.40 3.02 5.74 0.46 1.24 2.58 4.70
0.48 1.07 2.61 5.00 0.42 0.94 2.17 4.13
Table 1: Average numbers of significant biomarkers and true positives for the proposed and alternative methods based on 200 simulations.

4 Application to a breast cancer clinical study

In this section, we illustrate the practical application of the proposed method in a breast cancer clinical study (Loi et al., 2007). The dataset consisted of 414 estrogen receptor (ER)-positive breast cancer patients with microarray gene expression profiling data. We applied the proposed ODP method to detect the significant gene expression associated with the individual treatment effects. The data are available from the NCBI GEO database (GSE 6532). Here we adopted the time to relapse-free survival (RFS) as the outcome. After excluding patients with incomplete information, there were 268 and 125 patients receiving tamoxifen and alternative treatments, respectively. For each patient, 44928 gene expression measurements were available. For estimating the propensity score , we employed the following logistic linear regression:

We estimated the regression coefficients penalized likelihood with lasso (Tibshirani, 1996), where the tuning parameter was selected by 10-fold cross validation. With use of the estimated propensity score, we applied the two types of proposed ODP methods, ODP-P and ODP-N, with negative log-binomial likelihood. The estimated null probabilities were and for ODP-N and ODP-P, respectively, and estimated underlying distributions of effect sizes are shown in the right panel of Figure 1 with histogram of -values of each biomarker. From Figure 1

, both methods produced similar estimates of the underlying distribution, and the true underlying distribution of effect sizes would not have a simple form and seem right skewed. The numbers of detected microarrays with different FDR values based on the propose methods and the two standard methods describe in Section

3 are reported in Table 2. The proposed methods were able to detected much larger numbers of microarrays than the direct use of -values with statistics and . Also, ODP-P found more microarrays than ODP-N possibly because the normal approximation is not accurate in this situation where is around 400.

Figure 1: Histogram of -values of genes (left) and estimated distribution of underlying effect sizes (right) in breast cancer data.
FDR levels 5% 10% 15% 20%
ODP-P 23 65 125 218
ODP-N 10 26 47 75
1 2 2 2
0 0 0 0
Table 2: The numbers of detected genes in breast cancer data.

5 Discussions

We developed an efficient method for detecting predictive biomarkers associated with individual treatment effects based on the optimal discovery procedure with the synthetic posterior for effect sizes. The key feature of our approach is to employ the idea of direct estimation of individual treatment effect via the weighted loss function (Chen et al., 2017)

and combine the semi-parametric prior for interaction effects, which enables flexible and efficient estimation of the underlying signal components. Employing the empirical Bayes method, the signal components can be accurately estimated from the data. Then, the estimated model is combined with the optimal discovery procedure to make up a effective screening procedure under which the treatment effects are heterogenous, i.e. there exist predictive biomarkers. The advantage of the proposed method compared with existing approaches was confirmed through simulation study and the application to breast cancer data.

Although we employed the weighted loss function Chen et al. (2017) for constructing synthetic posterior distribution (5), other types of loss functions can be readily used in the proposed method by simply changing the formula of synthetic likelihood (2). For instance, A-learning method (Murphy, 2003; Lu et al., 2013; Ciarleglio et al., 2015), which also can directly estimate individual treatment effect, would be useful alternatives. Since the detailed investigation and comparison among several types of loss functions would extend the scope of this paper, it is left to a future study.


This work was supported by JST CREST Grant Number JPMJCR1412, JST AIP-PRISM Grant Number JPMJCR18Z9 and JSPS KAKENHI Grant Numbers JP17K19808, JP15K15954, JP16H07406, Japan.

Appendix (EM algorithm)


be the vector of unknown hyperparameters. We introduce two types of latent variables,

and , such that and . Then, logarithm of the synthetic likelihood (6) can be augmented as

With current parameter values and , we have the following objective function in the M-step:


The updating steps are given by

Hence, each iteration in the EM algorithm requires computing and as the E-step and updating and ’s as the M-step, which is continued until numerical convergence.


  • Benjamini and Hochberg (1995) Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B 57, 289-300.
  • Cao et al. (2009) Cao, J., Xie, X. J., Zhang, S., Whitehurst, A., and White, M. A. (2009). Bayesian optimal discovery procedure for simultaneous significance testing. BMC Bioinformatics 10, 5.
  • Chen et al. (2017) Chen, S., Tian, L., Cai, T., and Yu, M. (2017). A general statistical framework for subgroup identification and comparative treatment scoring. Biometrics 73, 1199-1209.
  • Ciarleglio et al. (2015) Ciarleglio, A., Petkova, E., Ogden, R. T. and Tarpey, T. (2015). Treatment decisions based on scalar and functional baseline covariates. Biometrics 71, 884-894.
  • Dempster et al. (1977) Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM Algorithm. Journal of the Royal Statistical Society: Series B 39, 1-38.
  • Gabriel and Normand (2012) Gabriel, S. E. and Normand, S. T. (2012). Getting the methods right–the foundation of patient-centered outcomes research. The New England Journal of Medicine 367, 787-790.
  • Kehl and Ulm (2006) Kehl, V., Ulm, K. (2006). Responder identification in clinical trials with censored data. Computational Statistics & Data Analysis 50, 1338-1355.
  • Lipkovich et al. (2017) Lipkovich, I., Dmitrienko, A., D’Agostino, R. B. (2017). Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine 36, 136-196.
  • Loi et al. (2007) Loi, S. and Haibe-Kains, B, Desmedt, C. et al. (2007). Definition of clinically distinct molecular subtypes in estrogen receptor-positive breast carcinomas through genomic grade. Journal of Clinical Oncology 25, 1239-1246.
  • Lu et al. (2018)

    Lu, M., Sadiq, S., Feaster, D. J. and Ishwaran, H. (2018). Estimating individual treatment effect in observational data sing random forest methods.

    Journal of Computational and Graphical Statistics 27, 209-219.
  • Lu et al. (2013) Lu, W., Zhang, H. H. and Zeng, D. (2013). Variable selection for optimal treatment decision. Statistical Methods in Medical Research 22, 493-504.
  • Matsui et al. (2012) Matsui, S., Simon, R., Qu, P., Shaughnessy, J. D., Barlogie, B., and Crowley, J. (2012). Developing and validating continuous genomic signatures in randomized clinical trials for predictive medicine. Clinical Cancer Research 18, 6065-6073.
  • Matsui et al. (2015) Matsui, S., Buyse, M. and Simon, R. (2015). Design and Analysis of Clinical Trials for Predictive Medicine. Chapman and Hall/CRC.
  • Matsui et al. (2018) Matsui, S., Noma, H., Qu, P., Sakai, Y., Matsui, K., Heuck, C., and Crowley, J. (2018). Multi-subgroup gene screening using semi-parametric hierarchical mixture models and the optimal discovery procedure: Application to a randomized clinical trial in multiple myeloma. Biometrics 74, 313-320.
  • McLachlan et al. (2006) McLachlan, G. J., Bean, R. W., and Jones, L. B. (2006). A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays. Bioinformatics 22, 1608-1615.
  • Murphy (2003) Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B 65, 331-355.
  • Noma and Matsui (2012) Noma, H. and Matsui, S. (2012). The optimal discovery procedure in multiple significance testing: an empirical Bayes approach. Statistics in Medicine 31, 165-176.
  • Noma and Matsui (2013) Noma, H. and Matsui, S. (2013). Empirical Bayes ranking and selection methods via semiparametric hierarchical mixture models in microarray studies. Statistics in Medicine 32, 1904-1916.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70, 41-55.
  • Rubin (2005) Rubin, D. B. (2005). Causal inference using potential outcomes: design, modeling, decisions. Journal of the American Statistical Association 100, 322-331.
  • Shen and Louis (1999) Shen, W. and Louis, T.A. (1999). Empirical Bayes estimation via the smoothing by roughening approach. Journal of Computational and Graphical Statistics 8, 800-823.
  • Storey (2007) Storey, J. D. (2007). The optimal discovery procedure: a new approach to simultaneous significance testing. Journal of the Royal Statistical Society: Series B 69, 347-368.
  • Storey et al. (2007) Storey, J. D., Dai, J. Y. and Leek, J. T. (2007). The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 8, 414-432.
  • Storey and Tibshirani (2003) Storey, J. D. and Tibshirani, R. (2003). Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences of the United States of America, 100, 9440-9445.
  • Tian et al. (2014) Tian, L., Alizadeh, A. A., Gentles, A. J. and Tibshirani, R. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association 109, 1517-1532.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression selection and shrinkage via the Lasso. Journal of Royal Statistical Society: Series B 58, 267-288.
  • Wager and Athey (2018) Wager, S and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113, 1228-1242.
  • Zhang et al. (2017) Zhang, W., Le, T. D., Liu, L., Zhou, Z. H. and Li, J. (2017). Mining heterogeneous causal effects for personalized cancer treatment. Bioinformatics 33, 2372-2378.
  • Zhang and Zhang (2018) Zhang, B. And Zhang, M. (2018). Variable selection for estimating the optimal treatment regimes in the presence of a large number of covariates. The Annals of Applied Statistics 12, 2335-2358.