multiple-instance logistic regression with lasso penalty.
In this work, we consider a manufactory process which can be described by a multiple-instance logistic regression model. In order to compute the maximum likelihood estimation of the unknown coefficient, an expectation-maximization algorithm is proposed, and the proposed modeling approach can be extended to identify the important covariates by adding the coefficient penalty term into the likelihood function. In addition to essential technical details, we demonstrate the usefulness of the proposed method by simulations and real examples.READ FULL TEXT VIEW PDF
This short note is to point the reader to notice that the proof of high
Motivated by a hemodialysis monitoring study, we propose a logistic mode...
We propose a combined model, which integrates the latent factor model an...
Data likelihood of fire detection is the probability of the observed
We propose a penalized likelihood method that simultaneously fits the
We consider a problem of ecological inference, in which individual-level...
We present a family of expectation-maximization (EM) algorithms for bina...
multiple-instance logistic regression with lasso penalty.
We consider the data generated from a stable manufacturing process. A total of subjects are obtained, and each subject consists of a number of components. Along with each component, predictors are observed. The anticipated response is the status of the component, defective or not. However, it is impractical to check the status of all components within each subject. The status of the subject, instead, is observed. For a particular subject, if its one or more components are defective, the subject is defective, and otherwise the subject is not defective. The goal of this work is to predict whether a subject is defective and to identify covariates that plausibly affect the defect rate especially when the pool of covariates is very large and only a few of them truly affects the defect rate.
For the purpose of defect prediction, multiple-instance (MI) learning 
is a solution. The difference between the traditional supervised learning and the MI learning is as follows. In the traditional supervised learning setting, the labels of each instance (components) are given, while in a typical MI setting, instances are grouped into bags (subjects) and only the labels of each bag are known, i.e. labels for the instance are missing. That is, we do not have the complete data for model fitting. To analyze MI data, the relationship between the instances and bags must be explicitly posited. Most of the research on MI learning is based on the standard MI assumption
which assumes that a positive bag contains at least one positive instance while a negative bag contains no positive instances and all instances in a negative bag must be negative. This assumption is hold throughout this article. Many methods have been proposed for MI learning. Most of these methods are extensions of support vector machine and logistic regression. Other methods such as Diverse Density and EM-DD  are also feasible.
The first goal of this study focuses on using logistic regression to model MI data. This method is named multiple-instance logistic regression (MILR) in  and . We first fix notation. Consider an experiment with subjects (bags). Suppose that, for the th subject, independent components (instances) are obtained. For the th component of the th subject, the data consists of binary response and the corresponding covariates , a -dimensional vector. We model the response-predictor relationship by logistic regression; that is where with , is a constant term and is a unknown coefficient vector. However, in this experiment, the labels of instances, ’s, are not observable. Instead, the labels of the bags, ’s, is observed. The logistic regression for bags is therefore
with likelihood . Directly maximizing with respect to can be initial-value sensitive or unstable while the number of missing variables (the number of components per subject) increases.
In literature, instead of maximizing the likelihood function directly, alternative likelihood functions were applied. Especially, several functions of were proposed to model
. For example, arithmetic mean and geometric mean ofwere used to model in  whereas the softmax function
were used to model in  where is a pre-specified nonnegative value. According to the relationship between the bag and the associated instances, the geometric, the arithmetic and the softmax function have the following relationship
for all . Consequently, when using the same data, the resulting maximum likelihood estimates for these link functions should be different although the estimates of ’s may be similar. We conclude that directly tackling the likelihood function (1), if possible, is more relevant than others when parameter estimation is also an important goal of the experiment. In order to obtain the maximum likelihood estimates, an expectation maximization algorithm  is proposed because we treat the labels of the components as missing variables.
Another goal of this work is to identify important covariates affecting the defect rate in both the instance and the bag levels and to predict the rate change when a covariate is changed. This goal supports the use of (1) because the regression coefficient estimate is essential to predict the rate change. When the number of covariates is large using the traditional variable selection tool such as Wald test is not efficient. Alternatively, maximum likelihood approach with LASSO penalty (Tibshironi, 1996) is promising. In this work, we incorporate the LASSO approach to the proposed MILR and provide an efficient computer algorithm for variable selection and estimation. Finally the important variables are identified if the corresponding coefficient estimations are nonzero.
The rest of this article is as follows. In Section 2, we introduce expectation-maximization (EM; ) algorithm to find the maximum likelihood estimator of MILR. In Section 3, we discuss the technical details about how to integrate the LASSO approach to the MILR. In Section 4, we use simulation to demonstrate the benefit of using MILR in the standpoint of variable selection and parameter estimation in contrast to the naive method and other MILR methods. Finally, in Section 5, we use various datasets to evaluate the proposed method.
Here, we follow the notation defined in previous section. When the labels of the instance level, ’s, are observed, the complete data likelihood function is
where . However, in MI experiments, ’s are not observable and instead the labels of the bag level, ’s, are observed. Under this circumstance, the naive approach uses the likelihood
by setting for all . The resulting testing and estimation for and
is questionable since the probability model does not fit the underlying data generating process. The idea of the naive approach is that since the instance labels are missing, the bag label is used to guess the instance labels. A better approach to treat missing data is the EM algorithm.
To deliver the E-step, the complete data likelihood and the conditional distribution of missing data conditional on observed data are required. The complete data log-likelihood is straightforward,
The conditional distribution is discussed under two conditions. First, when ,
and second, when ,
Thus, the required conditional expectations are
Because is a function of , denote . Consequently, for the th subject, the function in the E-step is
where and are the estimate obtained in step . Let .
Next, we move to the M-step, i.e. maximize with respect to . However, this is a nonlinear function of and, consequently, the maximization is computational expensive. Following , we applied the quadratic approximation to the function. By taking Taylor expansion about and , we have
where is a constant which is independent of and ; is the remainder term;
, and . Using this quadratic approximation , computing time can be boosted up to times faster than the program without using approximation. Hereafter, we work on rather than .
In the M-step, we have to solve the following maximization problem,
Since is a quadratic function of and , the maximization problem is equivalent to finding the root of
for all , where for all , and is the th element of . Here we adopted coordinate decent algorithm (updating one coordinate at a time) proposed in . Since (2) is a linear in terms of ’s, the updating formula for is straight forward. At step , let and
where , and is with its th element replaced by 0. The updating formula is
In the manufacturing process, one important issue is to identify the active factors within the process, especially for large . Traditionally the stepwise procedure is used to search the active covariates and after identifying these important covariates, the coefficients of these covariates are estimated based on the current model. Here we want to integrate the maximum likelihood coefficient estimation and the variable selection into one single procedure. Thus the idea is to shrike the small coefficient values to be zeros. Therefore LASSO type method  is adopted.
In order to perform estimation and variable selection at the same time, we include LASSO penalty into our model to shrink the unimportant coefficients to zero. In this work, the intercept is always kept in the model. The resulting optimization problem is therefore
Shooting algorithm  is efficient to solve this optimization problem. The resulting updating formula is
To choose the optimal tuning parameter , we first determine the upper bound of , say which enforces to be . We notice that and
So when , we have, for any ,
where the first inequality is due to Cauchy-Schwarz inequality and (3), and the equality right next to the inequality is due to that ’s are normalized prior to data analysis.
Several technical details are crucial to end up with an automatic parameter tuning. We follow the suggestion of  to adjust our computer codes. First, we choose a sequence of , ranging from to in a descending order, say . Set and the length of the sequence . The optimal is chosen among these values. Second, when is too small, the value of stored in computer may deviate from the true greatly. In this sequel, when we set and and when we set and . Finally, we choose the best tuning parameter by -fold cross validation. The procedure for choosing tuning parameter applied in this note is
FOR in the sequence of ’s
Randomly split the data into subsets used for -fold cross-validation
Estimate the parameters using and the whole data except for the th subset
Compute deviance = using the estimated parameters and the th subset
Compute the mean and standard error of the 10 deviances
Choose the optimal tuning parameter as the with the smallest mean deviance
For demonstration, we set , , with only 5 out of them are active and . The results are shown in Figures 1 and 2. The optimal selected via deviance is 2.31.
To demonstrate the powerfulness of the proposed model, we consider a simulation with data generating process as addressed in Section 4. We generated 100 datasets with , , , and . That is, we only generate 3 covariates and the third covariate is inactive to the response. Simulation results are summarized in Table I
which shows that using the MILR results in unbiased estimations and more powerful (Wald) tests than using the naive method. As shown in TableI, the MLEs of MILR are empirically unbiased and more powerful in contrast to the naive method. Especially, the naive estimates of regression coefficients were severely attenuated which may result in relatively high prediction errors. This says that if the goal of data analysis is to identify important covariates then the Naive and the MILR approach may not yield drastically different results. However, if the goal is to predict whether change of one particular covariate can reduce the chance of being defect, then the naive approach may mislead the result.
|Naive||(-0.19, 0.02,||(0.34, 0.01,||(-0.33, 0.01,||(0.01, 0.01,|
|MILR||(-2.29, 0.05,||(1.28, 0.06,||(-1.02, 0.04,||(0.04, 0.03,|
In this example, we demonstrate the performance of the proposed method for large small cases. The data generating process is designed as follows. We generated independent datasets. Each dataset consists of subjects and each subject consists of components. The response of the th component nested in the th subject, , follows where is a
-dimensional vector randomly and independently sampled from the standard normal distribution. The response of theth subject is defined as . We simulated data with all possible configurations of factor and . The number of predictors is excluding the intercept. For each data set, we randomly assigned and 95 multiples of 0 to the regression coefficient of predictors. Last, is the value which minimizes the deviance using 10-fold cross-validation. Following three variable selection schemes are considered:
MILR model with LASSO penalty;
MILR model with forward selection using Wald test with ; and
Naive model with forward selection using Wald test with .
|Model||True Positive||False Positive||True Negative||False Negative|
To compare our MILR-LASSO with other MILR methods (MILR-s(3) from  and MILR-s(0) from ), we designed three different simulation schemes. The first scheme consider fixed ; the second consider various with mean 5; and the third consider various with mean 65. These schemes mimicked the real datasets MUSK1 and MUSK2 which will be introduced in Section V. Some summary statistics about these datasets are listed in Table IV. The regression coefficients used to generate these simulated data sets are the estimated coefficients of the MUSK data sets using MILR-LASSO model. Hence, most of the coefficients are zeros (about only of the coefficients are non-zero). Besides using 10-fold cross-validation to select the optimal , we also choose BIC to obtain optimal LASSO model which is more efficient for it only need a single fit . The following are three different simulation schemes:
, , for all
, , (similar to MUSK1 dataset shown in Case Studies )
, , (similar to MUSK2 dataset shown in Case Studies)
We set instead of to avoid the case of . Then we used 10-fold stratified cross-validation to test these algorithms. To generate the subject-level prediction from the estimated coefficients of these algorithms, we use as a threshold. Thus,
where . To evaluate these three algorithms, two summary statistics are reported: accuracy (ACC), and the area under the ROC curve (AUC). Also, each algorithm is repeated times.
From Table III, it is obvious that MILR-LASSO outperforms the other two methods, since over of the predictors are nuisance variables. Furthermore, both BIC and 10-fold CV provide similar prediction results. Thus, for the sake of efficiency, we prefer using BIC to find the optimal model.
The MUSK data sets are the most widely used benchmark data sets when comparing different MI learning methods. The MUSK datasets consist of conformations. Each conformation is represented by 166 features. In MUSK1, the average conformation in one bag is 5; whereas, in MUSK2, the average conformation in one bag is 65. More descriptive statistics about these datasets are shown in TableIV. For more detailed descriptions, please refer to .
|Data set||Sample Size||Prop. of Positive Subj|
: the average bag size.
To apply MILR-LASSO on these data sets, first we choose the tuning parameter () via 10 fold cross-validation. Then, use the selected to fit the data sets. To avoid the likelihood from being only locally optimal, we replicate each algorithm 10 times and average their performance. The right halves of the Tables V and Table VI are the predicted results (10-fold cross-validation) of three different algorithms and the left halves of Table V and Table VI are the fitted results.
In this work, the multiple instance learning is treated as a classical missing value problem and solved by EM algorithm. In addition, the lasso approach is applied to identify important covariates. This treatment allows us to figure out influential covariates, to predict defect rate, and, most importantly, to direct ways to potentially reduce the defect rate by adjusting covariates. The limitations of the proposed method are as follows. First, we ignore the potential dependency among observations within a subject. Random effects can be incorporated into the proposed logistic regression to represent the dependency. Second, in a preliminary simulation study, not shown in this paper, we discovered that the maximum likelihood estimator is biased under the model (F). Bias reduction methods such as  and  will be applied in our future work.
The Knowledge Engineering Review, 25, 1–25.
Proceedings of the 22nd International Conference on Machine Learnings, ACM, 697–704.
H. Zou, T. Hastie, and R. Tibshirani. (2007) ”On the “degrees of freedom” of the lasso”. In:The Annals of Statistics, 2173–2192.