Sparse Variable Selection on High Dimensional Heterogeneous Data with Tree Structured Responses
We consider the problem of sparse variable selection on high dimension heterogeneous data sets, which has been taken on renewed interest recently due to the growth of biological and medical data sets with complex, non-i.i.d. structures and prolific response variables. The heterogeneity is likely to confound the association between explanatory variables and responses, resulting in a wealth of false discoveries when Lasso or its variants are naïvely applied. Therefore, the research interest of developing effective confounder correction methods is growing. However, ordinarily employing recent confounder correction methods will result in undesirable performance due to the ignorance of the convoluted interdependency among the prolific response variables. To fully improve current variable selection methods, we introduce a model that can utilize the dependency information from multiple responses to select the active variables from heterogeneous data. Through extensive experiments on synthetic and real data sets, we show that our proposed model outperforms the existing methods.
READ FULL TEXT