Multiple-Instance Logistic Regression with LASSO Penalty

07/13/2016 ∙ by Ray-Bing Chen, et al. ∙ 0

In this work, we consider a manufactory process which can be described by a multiple-instance logistic regression model. In order to compute the maximum likelihood estimation of the unknown coefficient, an expectation-maximization algorithm is proposed, and the proposed modeling approach can be extended to identify the important covariates by adding the coefficient penalty term into the likelihood function. In addition to essential technical details, we demonstrate the usefulness of the proposed method by simulations and real examples.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


multiple-instance logistic regression with lasso penalty.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider the data generated from a stable manufacturing process. A total of subjects are obtained, and each subject consists of a number of components. Along with each component, predictors are observed. The anticipated response is the status of the component, defective or not. However, it is impractical to check the status of all components within each subject. The status of the subject, instead, is observed. For a particular subject, if its one or more components are defective, the subject is defective, and otherwise the subject is not defective. The goal of this work is to predict whether a subject is defective and to identify covariates that plausibly affect the defect rate especially when the pool of covariates is very large and only a few of them truly affects the defect rate.

For the purpose of defect prediction, multiple-instance (MI) learning [2]

is a solution. The difference between the traditional supervised learning and the MI learning is as follows. In the traditional supervised learning setting, the labels of each instance (components) are given, while in a typical MI setting, instances are grouped into bags (subjects) and only the labels of each bag are known, i.e. labels for the instance are missing. That is, we do not have the complete data for model fitting. To analyze MI data, the relationship between the instances and bags must be explicitly posited. Most of the research on MI learning is based on the standard MI assumption


which assumes that a positive bag contains at least one positive instance while a negative bag contains no positive instances and all instances in a negative bag must be negative. This assumption is hold throughout this article. Many methods have been proposed for MI learning. Most of these methods are extensions of support vector machine and logistic regression. Other methods such as Diverse Density

[7] and EM-DD [12] are also feasible.

The first goal of this study focuses on using logistic regression to model MI data. This method is named multiple-instance logistic regression (MILR) in [11] and [9]. We first fix notation. Consider an experiment with subjects (bags). Suppose that, for the th subject, independent components (instances) are obtained. For the th component of the th subject, the data consists of binary response and the corresponding covariates , a -dimensional vector. We model the response-predictor relationship by logistic regression; that is where with , is a constant term and is a unknown coefficient vector. However, in this experiment, the labels of instances, ’s, are not observable. Instead, the labels of the bags, ’s, is observed. The logistic regression for bags is therefore


with likelihood . Directly maximizing with respect to can be initial-value sensitive or unstable while the number of missing variables (the number of components per subject) increases.

In literature, instead of maximizing the likelihood function directly, alternative likelihood functions were applied. Especially, several functions of were proposed to model

. For example, arithmetic mean and geometric mean of

were used to model in [11] whereas the softmax function

were used to model in [9] where is a pre-specified nonnegative value. According to the relationship between the bag and the associated instances, the geometric, the arithmetic and the softmax function have the following relationship

for all . Consequently, when using the same data, the resulting maximum likelihood estimates for these link functions should be different although the estimates of ’s may be similar. We conclude that directly tackling the likelihood function (1), if possible, is more relevant than others when parameter estimation is also an important goal of the experiment. In order to obtain the maximum likelihood estimates, an expectation maximization algorithm [1] is proposed because we treat the labels of the components as missing variables.

Another goal of this work is to identify important covariates affecting the defect rate in both the instance and the bag levels and to predict the rate change when a covariate is changed. This goal supports the use of (1) because the regression coefficient estimate is essential to predict the rate change. When the number of covariates is large using the traditional variable selection tool such as Wald test is not efficient. Alternatively, maximum likelihood approach with LASSO penalty (Tibshironi, 1996) is promising. In this work, we incorporate the LASSO approach to the proposed MILR and provide an efficient computer algorithm for variable selection and estimation. Finally the important variables are identified if the corresponding coefficient estimations are nonzero.

The rest of this article is as follows. In Section 2, we introduce expectation-maximization (EM; [1]) algorithm to find the maximum likelihood estimator of MILR. In Section 3, we discuss the technical details about how to integrate the LASSO approach to the MILR. In Section 4, we use simulation to demonstrate the benefit of using MILR in the standpoint of variable selection and parameter estimation in contrast to the naive method and other MILR methods. Finally, in Section 5, we use various datasets to evaluate the proposed method.

Ii Multiple-Instance Logistic Regression with EM Algorithm

Here, we follow the notation defined in previous section. When the labels of the instance level, ’s, are observed, the complete data likelihood function is

where . However, in MI experiments, ’s are not observable and instead the labels of the bag level, ’s, are observed. Under this circumstance, the naive approach uses the likelihood

by setting for all . The resulting testing and estimation for and

is questionable since the probability model does not fit the underlying data generating process. The idea of the naive approach is that since the instance labels are missing, the bag label is used to guess the instance labels. A better approach to treat missing data is the EM algorithm.

To deliver the E-step, the complete data likelihood and the conditional distribution of missing data conditional on observed data are required. The complete data log-likelihood is straightforward,

The conditional distribution is discussed under two conditions. First, when ,

and second, when ,

Thus, the required conditional expectations are

Because is a function of , denote . Consequently, for the th subject, the function in the E-step is

where and are the estimate obtained in step . Let .

Next, we move to the M-step, i.e. maximize with respect to . However, this is a nonlinear function of and, consequently, the maximization is computational expensive. Following [6], we applied the quadratic approximation to the function. By taking Taylor expansion about and , we have

where is a constant which is independent of and ; is the remainder term;

, and . Using this quadratic approximation , computing time can be boosted up to times faster than the program without using approximation. Hereafter, we work on rather than .

In the M-step, we have to solve the following maximization problem,

Since is a quadratic function of and , the maximization problem is equivalent to finding the root of


for all , where for all , and is the th element of . Here we adopted coordinate decent algorithm (updating one coordinate at a time) proposed in [6]. Since (2) is a linear in terms of ’s, the updating formula for is straight forward. At step , let and

where , and is with its th element replaced by 0. The updating formula is

Iii Penalized Multiple-Instance Logistic Regression

In the manufacturing process, one important issue is to identify the active factors within the process, especially for large . Traditionally the stepwise procedure is used to search the active covariates and after identifying these important covariates, the coefficients of these covariates are estimated based on the current model. Here we want to integrate the maximum likelihood coefficient estimation and the variable selection into one single procedure. Thus the idea is to shrike the small coefficient values to be zeros. Therefore LASSO type method [10] is adopted.

In order to perform estimation and variable selection at the same time, we include LASSO penalty into our model to shrink the unimportant coefficients to zero. In this work, the intercept is always kept in the model. The resulting optimization problem is therefore

Shooting algorithm [5] is efficient to solve this optimization problem. The resulting updating formula is

for .

To choose the optimal tuning parameter , we first determine the upper bound of , say which enforces to be . We notice that and


So when , we have, for any ,

where the first inequality is due to Cauchy-Schwarz inequality and (3), and the equality right next to the inequality is due to that ’s are normalized prior to data analysis.

Several technical details are crucial to end up with an automatic parameter tuning. We follow the suggestion of [6] to adjust our computer codes. First, we choose a sequence of , ranging from to in a descending order, say . Set and the length of the sequence . The optimal is chosen among these values. Second, when is too small, the value of stored in computer may deviate from the true greatly. In this sequel, when we set and and when we set and . Finally, we choose the best tuning parameter by -fold cross validation. The procedure for choosing tuning parameter applied in this note is

  1. FOR in the sequence of ’s

    1. Randomly split the data into subsets used for -fold cross-validation

    2. FOR to

      1. Estimate the parameters using and the whole data except for the th subset

      2. Compute deviance = using the estimated parameters and the th subset

    3. END FOR

    4. Compute the mean and standard error of the 10 deviances

  2. END FOR

  3. Choose the optimal tuning parameter as the with the smallest mean deviance

For demonstration, we set , , with only 5 out of them are active and . The results are shown in Figures 1 and 2. The optimal selected via deviance is 2.31.

Fig. 1: Ten-fold cross-validation on the simulated data sets. The red line is the mean deviance and the blue lines are bounds for deviances within one standard error.
Fig. 2: The change of the estimated parameters with respect to the tuning parameter. The red lines stands for active covariates and the blue lines stands for inactive covariates.

Iv Simulation Studies

Iv-a Naive vs MILR

To demonstrate the powerfulness of the proposed model, we consider a simulation with data generating process as addressed in Section 4. We generated 100 datasets with , , , and . That is, we only generate 3 covariates and the third covariate is inactive to the response. Simulation results are summarized in Table I

which shows that using the MILR results in unbiased estimations and more powerful (Wald) tests than using the naive method. As shown in Table

I, the MLEs of MILR are empirically unbiased and more powerful in contrast to the naive method. Especially, the naive estimates of regression coefficients were severely attenuated which may result in relatively high prediction errors. This says that if the goal of data analysis is to identify important covariates then the Naive and the MILR approach may not yield drastically different results. However, if the goal is to predict whether change of one particular covariate can reduce the chance of being defect, then the naive approach may mislead the result.

Naive (-0.19, 0.02, (0.34, 0.01, (-0.33, 0.01, (0.01, 0.01,
0.39) 0.80) 0.77) 0.03)
MILR (-2.29, 0.05, (1.28, 0.06, (-1.02, 0.04, (0.04, 0.03,
0.93) 0.86) 0.87) 0.06)
TABLE I: (Average Estimate, Standard Error, Power) of Regression Coefficient Estimation / Testing

Iv-B Milr-Lasso

In this example, we demonstrate the performance of the proposed method for large small cases. The data generating process is designed as follows. We generated independent datasets. Each dataset consists of subjects and each subject consists of components. The response of the th component nested in the th subject, , follows where is a

-dimensional vector randomly and independently sampled from the standard normal distribution. The response of the

th subject is defined as . We simulated data with all possible configurations of factor and . The number of predictors is excluding the intercept. For each data set, we randomly assigned and 95 multiples of 0 to the regression coefficient of predictors. Last, is the value which minimizes the deviance using 10-fold cross-validation. Following three variable selection schemes are considered:

  1. MILR model with LASSO penalty;

  2. MILR model with forward selection using Wald test with ; and

  3. Naive model with forward selection using Wald test with .

Model True Positive False Positive True Negative False Negative
(A) 0.78 0.15 0.85 0.22
(B) 0.72 0.06 0.94 0.28
(C) 0.58 0.07 0.93 0.42

TABLE II: Variable Selection Results (n=100, m=3)

Iv-C Compare with other methods

To compare our MILR-LASSO with other MILR methods (MILR-s(3) from [9] and MILR-s(0) from [11]), we designed three different simulation schemes. The first scheme consider fixed ; the second consider various with mean 5; and the third consider various with mean 65. These schemes mimicked the real datasets MUSK1 and MUSK2 which will be introduced in Section V. Some summary statistics about these datasets are listed in Table IV. The regression coefficients used to generate these simulated data sets are the estimated coefficients of the MUSK data sets using MILR-LASSO model. Hence, most of the coefficients are zeros (about only of the coefficients are non-zero). Besides using 10-fold cross-validation to select the optimal , we also choose BIC to obtain optimal LASSO model which is more efficient for it only need a single fit [13]. The following are three different simulation schemes:

  • , , for all

  • , , (similar to MUSK1 dataset shown in Case Studies )

  • , , (similar to MUSK2 dataset shown in Case Studies)

We set instead of to avoid the case of . Then we used 10-fold stratified cross-validation to test these algorithms. To generate the subject-level prediction from the estimated coefficients of these algorithms, we use as a threshold. Thus,

where . To evaluate these three algorithms, two summary statistics are reported: accuracy (ACC), and the area under the ROC curve (AUC). Also, each algorithm is repeated times.

Scheme Method ACC AUC
(D) MILR-LASSO(BIC) 0.61(0.008) 0.62(0.012)
MILR-LASSO(10-fold CV) 0.64(0.008) 0.61(0.014)
MILR-s(3) 0.58(0.004) 0.57(0.009)
MILR-s(0) 0.58(0.005) 0.57(0.009)
(E) MILR-LASSO(BIC) 0.70(0.007) 0.75(0.009)
MILR-LASSO(10-fold CV) 0.70(0.007) 0.75(0.009)
MILR-s(3) 0.58(0.004) 0.59(0.008)
MILR-s(0) 0.58(0.004) 0.58(0.009)
(F) MILR-LASSO(BIC) 0.82(0.003) 0.53(0.006)
MILR-LASSO(10-fold CV) 0.82(0.003) 0.53(0.007)
MILR-s(3) 0.18(0.003) 0.46(0.009)
MILR-s(0) 0.18(0.003) 0.45(0.011)

TABLE III: Predicted results of different methods on simulated data sets

From Table III, it is obvious that MILR-LASSO outperforms the other two methods, since over of the predictors are nuisance variables. Furthermore, both BIC and 10-fold CV provide similar prediction results. Thus, for the sake of efficiency, we prefer using BIC to find the optimal model.

V Case Studies

The MUSK data sets are the most widely used benchmark data sets when comparing different MI learning methods. The MUSK datasets consist of conformations. Each conformation is represented by 166 features. In MUSK1, the average conformation in one bag is 5; whereas, in MUSK2, the average conformation in one bag is 65. More descriptive statistics about these datasets are shown in Table

IV. For more detailed descriptions, please refer to [2].

Data set Sample Size Prop. of Positive Subj
MUSK1 166 5.17 92 476 51.08%
MUSK2 166 64.69 102 6598 38.24%

: the average bag size.

TABLE IV: Information about these two data sets

To apply MILR-LASSO on these data sets, first we choose the tuning parameter () via 10 fold cross-validation. Then, use the selected to fit the data sets. To avoid the likelihood from being only locally optimal, we replicate each algorithm 10 times and average their performance. The right halves of the Tables V and Table VI are the predicted results (10-fold cross-validation) of three different algorithms and the left halves of Table V and Table VI are the fitted results.

Fitted Predicted
MILR-LASSO 1.00 1.00 0.79 0.83
MILR-s(3) 0.85 0.96 0.72 0.76
MILR-s(0) 0.87 0.93 0.74 0.79

TABLE V: Fitted and predicted results of different methods on MUSK1 data set
Fitted Predicted
MILR-LASSO 0.87 0.96 0.69 0.76
MILR-s(3) 0.99 1.00 0.74 0.83
MILR-s(0) 0.95 1.00 0.79 0.85

TABLE VI: Fitted and predicted results of different methods on MUSK2 data set

From Tables V and VI, we can see that no algorithm is consistently better than the others. However, MILR-LASSO has the strength to select important features when estimating the coefficients.

Vi Conclusion

In this work, the multiple instance learning is treated as a classical missing value problem and solved by EM algorithm. In addition, the lasso approach is applied to identify important covariates. This treatment allows us to figure out influential covariates, to predict defect rate, and, most importantly, to direct ways to potentially reduce the defect rate by adjusting covariates. The limitations of the proposed method are as follows. First, we ignore the potential dependency among observations within a subject. Random effects can be incorporated into the proposed logistic regression to represent the dependency. Second, in a preliminary simulation study, not shown in this paper, we discovered that the maximum likelihood estimator is biased under the model (F). Bias reduction methods such as [8] and [3] will be applied in our future work.


  • [1] A. P. Dempster, N. M. Laird, and D. B. Rubin. (1977) “Maximum Likelihood from Incomplete Data via the EM Algorithm”. Journal of the Royal Statistical Society, Series B, 39, 1–38.
  • [2] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pèrez. (1997) “Solving the multiple instance problem with axis-parallel rectangles”. Artificial Intelligence, 89, 31–71.
  • [3] D. Firth. (1993) “Bias reduction of maximum likelihood estimates”. Biometrika, 80, 27–38.
  • [4] J. Foulds, and E. Frank. (2010) “A review of multi-instance learning assumptions”.

    The Knowledge Engineering Review

    , 25, 1–25.
  • [5] W. Fu. (1998) “Penalized regressions: the bridge versus the lasso”. Journal of Computational and Graphical Statistics, 7, 397–416.
  • [6] J. Friedman, T. Hastie, and R. Tibshirani. (2010) “Regularization paths for generalized linear models via coordinate descent”. Journal of Statistical Software, 33.
  • [7] O. Maron (1998) “Learning from ambiguity”. Ph. D. Thesis, Massachusetts Institute of Technology.
  • [8] M. H. Quenouille. (1956) “Notes on Bias in Estimation”. Biometrika, 43, 353–360.
  • [9] S. Ray, and M. Craven. (2005) “Supervised versus multiple instance learning: An empirical comparsion”, in

    Proceedings of the 22nd International Conference on Machine Learnings, ACM

    , 697–704.
  • [10] R. Tibshirani. (1996) “Regression shrinkage and selection via the lasso”. Journal of the Royal Statistical Society, Series B, 58, 267–288.
  • [11] X. Xu, and E. Frank. (2004) “Logistic regression and boosting for labeled bags of instances”. In Advances in knowledge discovery and data mining, Springer, 272–281.
  • [12] Q. Zhang, and S. A. Goldman. (2001) “Em-dd, An improved multiple-instance learning technique”. In: Advances in neural information processing system, 1073–1080.
  • [13]

    H. Zou, T. Hastie, and R. Tibshirani. (2007) ”On the “degrees of freedom” of the lasso”. In:

    The Annals of Statistics, 2173–2192.