Feature Selection for Data Integration with Mixed Multi-view Data

03/27/2019
by   Yulia Baker, et al.
berkeley college
Rice University
0

Data integration methods that analyze multiple sources of data simultaneously can often provide more holistic insights than can separate inquiries of each data source. Motivated by the advantages of data integration in the era of "big data", we investigate feature selection for high-dimensional multi-view data with mixed data types (e.g. continuous, binary, count-valued). This heterogeneity of multi-view data poses numerous challenges for existing feature selection methods. However, after critically examining these issues through empirical and theoretically-guided lenses, we develop a practical solution, the Block Randomized Adaptive Iterative Lasso (B-RAIL), which combines the strengths of the randomized Lasso, adaptive weighting schemes, and stability selection. B-RAIL serves as a versatile data integration method for sparse regression and graph selection, and we demonstrate the effectiveness of B-RAIL through extensive simulations and a case study to infer the ovarian cancer gene regulatory network. In this case study, B-RAIL successfully identifies well-known biomarkers associated with ovarian cancer and hints at novel candidates for future ovarian cancer research.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/11/2019

Integrative Generalized Convex Clustering Optimization and Feature Selection for Mixed Multi-View Data

In mixed multi-view data, multiple sets of diverse features are measured...
03/12/2018

Dissimilarity-based representation for radiomics applications

Radiomics is a term which refers to the analysis of the large amount of ...
04/05/2022

Incremental Unsupervised Feature Selection for Dynamic Incomplete Multi-view Data

Multi-view unsupervised feature selection has been proven to be efficien...
10/30/2020

View selection in multi-view stacking: Choosing the meta-learner

Multi-view stacking is a framework for combining information from differ...
04/25/2019

Adaptive Collaborative Similarity Learning for Unsupervised Multi-view Feature Selection

In this paper, we investigate the research problem of unsupervised multi...
02/24/2014

Machine Learning Methods in the Computational Biology of Cancer

The objectives of this "perspective" paper are to review some recent adv...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As the amount of data grows in volume and variety, data integration, or the analysis of multiple sources of data simultaneously, is becoming increasingly necessary in numerous disciplines. For example, in genomics, scientists can gather data from many related, yet distinct sources including gene expression, miRNA expression, point mutations, and DNA methylation. Since all of these genomic sources interact within the same biological system, it can be advantageous to analyze them together via data integration. Ultimately, the abundance and diversity of information captured by integrated data offers an invaluable opportunity to gain a better and more holistic understanding of the phenomena at hand.

In this work, we aim to perform feature selection for a common family of integrated data sets called high-dimensional multi-view

data. Multi-view data refers to data collected on the same set of samples, but with features from multiple sources of potentially mixed types (e.g. categorical, binary, count, proportion, continuous, and skewed continuous values). More formally, suppose we observe multi-view data with

high-dimensional views (or sources), , which are measured on the same samples but with features of mixed types. We seek to recover a sparse set of features from each associated with the response by considering:

(1.1)

Here, are the coefficients associated with view , is a tuning parameter which regulates the sparsity level, and is the generalized linear model (GLM) log-likelihood associated with . Note we not only consider continuous (Gaussian) responses, but also the broader class of GLMs including the Poisson (log-linear) and Bernoulli (logistic) families.

While there are many applications for multi-view feature selection in genomics, imaging, national security, economics, and other fields, major difficulties, stemming from the heterogeneity of features and how to appropriately integrate such differences, have prevented the successful use of multi-view feature selection in practice. To our knowledge, no one has proposed an effective practical solution to perform feature selection with multi-view data. A plethora of works have studied feature selection in the high-dimensional setting via the Lasso or GLM Lasso (Tibshirani, 1996; Yuan and Lin, 2007; Tibshirani et al., 2013; Zhao and Yu, 2006), and others have studied various data integration problems (Hall and Llinas, 1997; Shen, Olshen and Ladanyi, 2009; Acar, Kolda and Dunlavy, 2011). However, there is limited research at the intersection of the two fields.

The one area that touches on multi-view feature selection is in the context of mixed graphical models, which estimate sparse graphs between features in multi-view data

(Yang et al., 2014a, b; Lee and Hastie, 2013; Cheng et al., 2013; Haslbeck and Waldorp, 2015). Using the node-wise neighborhood estimation approach of Meinshausen and Bühlmann (2006), mixed graphical models estimate the neighborhood of each node (i.e. feature) separately via a penalized regression model (typically based on the Lasso or GLM Lasso) and combine neighborhoods using an “AND” or “OR” rule. However, though mixed graphical models perform well in idealized settings for which theoretical guarantees have been proven, we will demonstrate in Section 2 that there are severe limitations with these approaches in realistic settings with correlated, heterogeneous features, commonly found in multi-view data.

To facilitate more effective integrative analyses in practice, we investigate the under-studied problem of high-dimensional multi-view feature selection, and we propose a practical solution. Our work is the first to identify and critically examine the fundamental challenges of multi-view feature selection, and we leverage this deep understanding of the challenges to develop a new high-dimensional multi-view selection method, the Block Randomized Adaptive Iterative Lasso (B-RAIL). B-RAIL is a practical tool for multi-view feature selection with its roots grounded in theory, and it builds upon adaptive penalties, the randomized Lasso, and stability selection (Meinshausen and Bühlmann, 2010) to overcome the issues incurred by existing methods. Our method can be used for both regression and mixed graphical selection, thus lending itself to a host of important applications.

We organize this paper as follows. In Section 2, we investigate the major challenges of multi-view feature selection and highlight the literature gaps relating to these issues. We also show that the culmination of these challenges lead to poor feature recovery in existing Lasso-type methods and mixed graphical models. In Section 3, we introduce our proposed method, B-RAIL, which takes steps to address the challenges from Section 2. In Section 4, we showcase the strong empirical performance of B-RAIL through simulations and contrast it to existing methods. In Section 5, we further demonstrate the effectiveness of B-RAIL in a novel integrative genomics case study for ovarian cancer, and we provide concluding remarks in Section 6.

2 Challenges

Before introducing our proposed method, it is instructive to understand the challenges posed by feature selection in the multi-view setting. These challenges have been over-looked in previous methods and thus contribute to many of their shortcomings. In this section, we focus on the challenges faced by linear models with Lasso-type penalties due to their overwhelming popularity and desirable statistical properties (Tibshirani, 1996; Yuan and Lin, 2007; Meinshausen and Yu, 2009; Tibshirani et al., 2013; Zhao and Yu, 2006; Zhang and Huang, 2008). Given data and response , recall that the (GLM) Lasso solves

(2.1)

where is a regularization parameter and is the GLM log-likelihood associated with the response. For clarity, we use the term “Lasso” to refer to the -penalized model with continuous (Gaussian) responses, “GLM Lasso” to mean the -penalized model with non-Gaussian GLM responses (e.g. binary, Poisson), and “Lasso-type” methods to mean either the Lasso or GLM Lasso with some form of penalty (e.g. a global penalty, separate penalties, adaptive penalties).

Our focus here is not on deriving new theoretical guarantees for the Lasso in multi-view settings. Rather, we highlight deep practical concerns, which are rooted in theory and commonly arise in feature selection for data integration. By identifying these practical challenges, we open up numerous avenues for future theoretical research and set the stage for the construction of a new method, which overcomes the identified issues.

2.1 Motivating Example

To first illustrate the current challenges and motivate the need for a solution, we present in Figure 3 the estimated graphs from common Lasso-type methods and our proposed method when applied to real ovarian cancer genomics data. Here, there are samples and features from three views: count-valued RNASeq data (), continuous miRNA data (), and proportion-valued methylation data () (refer to Section 5 for data collection and pre-processing details). As in several previous graphical models and mixed graphical models (Meinshausen and Bühlmann, 2006; Ravikumar et al., 2010; Jalali et al., 2011), we estimated the graphs using node-wise neighborhood selection. We then combined neighborhoods using the “AND” rule and applied stability selection (Meinshausen and Bühlmann, 2010; Liu, Roeder and Wasserman, 2010) with the threshold to select stable edges.

Figure 3 specifically compares three types of estimation schemas at each node: (a) GLM Lasso with one global penalty, (b) GLM Lasso with separate penalties for each view, and (c) our proposed B-RAIL algorithm (introduced in Section 3). The first two methods have been proposed in several mixed graphical models (Yang et al., 2014a; Chen, Witten and Shojaie, 2014; Haslbeck and Waldorp, 2015) and satisfy strong theoretical guarantees in idealized settings. In the real data example however, the Lasso-type methods are unstable (illustrated by the fewer edges), favor feature selection within one view, and select only a few edges between views. This overall instability indicates that the Lasso-type methods are not robust to small perturbations of the data and raises serious concerns about the reproducibility and reliability of the results (Yu, 2013). Our proposed B-RAIL algorithm, in contrast, avoids these issues and exhibits greater stability as well as balance, selecting a larger number of within and between block edges under the same thresholding value. We will later see through extensive simulations in Section 4 that the issues with existing Lasso-type methods observed here are recurring problems in very general multi-view scenarios.

Figure 0: Lasso with a single global penalty
Figure 1: Lasso with separate penalties
Figure 2: The B-RAIL algorithm
Figure 3: We compare three different graph selection methods when applied to real ovarian cancer genomics data. The data is comprised of three blocks: RNASeq (red), miRNA (blue), and methylation (green), with and . For all three methods, we use stability selection with the threshold to select stable edges, and hence, we can directly compare the number of selected edges across the various methods. The Lasso with a single global penalty and the Lasso with separate penalties select few edges within blocks and almost no edges between blocks, indicating that these methods are highly unstable to small perturbations of the data.

To begin understanding why existing Lasso-type methods struggle in practice, we identify and study four major challenges of feature selection for high-dimensional multi-view data: 1) scaling, 2) ultra-high-dimensionality, 3) signal interference, and 4) domain-specific beta-min. These issues stem from a combination of domain differences, signal differences, and the high-dimensionality of each view, and together, these challenges can have a significant adverse effect on feature recovery for data integration. We next examine each of these challenges in detail.

2.2 Scaling

The first and most obvious challenge with integrative analyses revolves around scaling. That is, each view in a multi-view data set is often measured on a different scale, and it is unclear how to most effectively integrate such differences. Many believe that normalizing all features to mean

and variance

remedies the scaling differences, but this is not always the case. Even after centering and scaling, data views remain distinct if they differ in ways beyond the first and second moments. This issue is especially problematic with binary and count-valued data blocks, two common types in multi-view data, since they are defined by much higher moments. We thus highly discourage using the ordinary (GLM) Lasso with a single penalty (

2.1) on normalized multi-view data.

Now while one can use different regularization parameters for each view to help alleviate the scaling differences, this generates another set of issues that are complicated by the following challenges. We will revisit the scaling issue in light of these complications later in this section.

2.3 Ultra-High-Dimensionality

In addition to the scaling issue, performing exact feature selection with the Lasso is already difficult in the ordinary high-dimensional setting. For exact feature selection, the number of samples must be above a theoretical minimum known as the sample complexity. In the highly idealized scenario of an iid standard Gaussian design and a Gaussian response, Wainwright (2009) showed that the sample complexity scales at approximately , where is the number of features, and is the number of non-zero features. This idealized lower bound can be difficult to attain in many applications including genomics, where typical values of and demand patients - a large and highly expensive study. We informally refer to the regime where as “high-dimensional” and as “ultra-high-dimensional.” Roughly, the Lasso can never perform exact feature selection in the ultra-high-dimensional regime.

For non-Gaussian responses and more realistic designs such as correlated, heterogeneous views in multi-view data, the sample complexity is significantly higher than the idealized Gaussian bound (Chen, Witten and Shojaie, 2014; Ravikumar et al., 2010). As an example, the Poisson GLM’s sample complexity scales at approximately (Yang et al., 2015), so if and , we require samples. This problem is further exacerbated in multi-view settings since combining multiple high-dimensional views for data integration almost always results in an ultra-high-dimensional problem.

2.4 Signal Interference

The third challenge we identify stems from a problem with the Lasso known as shrinkage noise. Su, Bogdan and Candes (2015)

showed that with high probability, no matter how strong the effect sizes, false discoveries appear early on the Lasso path due to pseudo-noise introduced by shrinkage in the high- and ultra-high-dimensional regimes. When the Lasso selects its first few features using large regularization parameters, the residuals still contain much of the signal associated with the selected features, and it is this extra noise which

Su, Bogdan and Candes (2015) calls shrinkage noise.

In the multi-view context, shrinkage noise becomes a very complex and serious issue due to the different signals across blocks. Since the Lasso naturally selects features from the block with the highest signal first, the resulting shrinkage noise will mask the weaker signals from other blocks and compromise our ability to select from these weaker blocks. We refer to this adverse consequence of shrinkage noise as signal interference.

Figure 3: Gaussian response, Gaussian and Binary predictors
Figure 4: Binary response, Gaussian and Binary predictors
Figure 5: We illustrate signal interference for both Gaussian and binary responses given iid Gaussian and iid binary predictors. We simulate , , and 10 true features in each block. We fix the SNR for the Gaussian block at and let the SNR of the binary block vary between and . The vertical red line highlights the point at which . As the SNR of the binary block increases, it interferes with the ability to recover the true Gaussian features. This signal interference is even more severe for binary responses.

The problem of shrinkage noise has not been widely studied beyond the iid Gaussian design in Su, Bogdan and Candes (2015), but we provide strong empirical evidence in Figure 5 that confirms the existence of shrinkage noise and signal interference in non-Gaussian multi-view settings. In the case of an iid Gaussian and an iid binary block, shown in Figure 5, we achieve perfect recovery in the Gaussian block when the signal-to-noise ratio (SNR) in the binary block is , but as the SNR of the binary block increases, it interferes with our ability to recover the Gaussian features in the small sample scenario of . This signal interference is especially disastrous in the GLM Lasso with binary responses, where support recovery in the Gaussian block tends to . When we increase the sample size to however, there is no decline in the recovery of the Gaussian block in Figure 5(a). This agrees with the known result from Su, Bogdan and Candes (2015) that shrinkage noise occurs when the Lasso’s theoretical conditions are violated and in particular, when is not sufficiently large.

2.5 Domain-Specific Beta-min Condition

Figure 6: We simulate Gaussian responses given four types of predictors (Gaussian, binary, uniform, Poisson) and compare our ability to recover the true features under the four designs. There are samples, features, and true features. The red and blue solid vertical lines indicate the minimum SNR required to achieve recovery for and , respectively. In the case of non-Gaussian predictors, dashed vertical lines are overlayed to compare the minimum SNR requirements to those of Gaussian predictors. These results show that different data types can tolerate different minimum SNRs.

Finally, analogous to how signal differences can exacerbate the Lasso’s shrinkage noise, domain differences in multi-view problems can complicate the Lasso’s beta-min condition, which establishes a lower bound for the minimum amount of signal (i.e. SNR) required for feature recovery (Zhao and Yu, 2006; Meinshausen and Bühlmann, 2006; Bühlmann et al., 2013).

In Figure 6, we report our ability to recover the true features for a simple simulation with iid features from four data types (Gaussian, binary, uniform, and Poisson). In each subplot, the red and blue solid lines indicate the minimum SNR needed to recover of all true features when and , respectively. We observe that the minimum SNR requirement varies based upon the domain of the features and that the Gaussian predictors can tolerate the lowest SNR. These empirical results reveal that if two blocks have the same amount of signal but are from different domains, we can only recover the features that pass the minimum signal threshold dictated by the domains. Put concretely, if we were to perform feature selection on our simulated multi-view data with and SNR , we would be able to recover of the true features in the Gaussian block but only about of the true features in the binary block. This observed phenomenon agrees with previous work, which has shown that an increase in the sparsity in effectively reduces the SNR in the high-dimensional setting (Wang, Wainwright and Ramchandran, 2010). Beyond this however, the beta-min condition has been relatively unexplored for the GLM Lasso and domain differences and remains a ripe area for future theoretical work.

2.6 Additional Challenges

It is important to note that the four challenges above do not act independently from one another. In fact, the main source of difficulty with multi-view feature selection is arguably the interactions between challenges. For instance, consider the problem of selecting features from high-dimensional discrete blocks with weak signals. The ultra-high-dimensionality issue can exacerbate the already existing problem of signal interference, which can then worsen scaling issues, increase minimum SNR requirements, and amplify the overall difficulty of the problem.

In conjunction with these complex interactions, the need to select an appropriate amount of regularization through model selection methods can also increase the difficulty of multi-view feature selection. We will compare three common selection methods, namely, stability selection (Meinshausen and Bühlmann, 2010; Liu, Roeder and Wasserman, 2010), cross validation (Stone, 1974; Allen, 1974; Shao, 1993), and extended BIC (Chen and Chen, 2012), and discuss their additional challenges in Section 4.

We lastly note that the majority of our discussion has been focused on the Lasso. Feature selection is even more challenging for the GLM Lasso. Chen, Witten and Shojaie (2014) investigated this for mixed graphical models and concluded that the predictors associated with Gaussian responses are easier to recover than those with responses from other exponential families. Specifically, Gaussian responses require fewer samples, allow for a wider range of tuning parameters, and generally have a higher probability of success.

2.7 Challenges with Existing Methods

Having identified a host of challenges, we return to address why common Lasso-type methods are not well-suited for multi-view feature selection.

To begin with its most simple form, the Lasso with a single global penalty (2.1) uses the same penalty for all views and does not alleviate the scaling issues or signal interference issues in multi-view data. The consequences of these problems are evident, especially in the case of non-Gaussian blocks with weak signals, in Figure 3(a), where fewer edges are selected within the proportion-valued methylation block.

By employing the Lasso with separate penalties for each data view, we can mitigate the issue of scaling. Nevertheless, model selection becomes more challenging with multiple penalties, and signal interference remains a driver of poor recovery. In fact, having a separate penalty for each view exacerbates signal interference and encourages selection from the block with the strongest signal and no selection from the blocks with weak signals. This signal interference is exemplified in Figure 3 by the extreme selection imbalance among views, with almost no selection in the miRNA block and heavy selection in the RNASeq block.

In the Adaptive Lasso (Zou, 2006), the amount of -regularization associated with is typically for some constants . This adaptive penalty mitigates the scaling issue by adjusting for signal differences through , but it also encourages selection of features with higher signals and penalizes features with weaker signals. The Adaptive Lasso hence complicates signal interference by treating weaker signals as noise and results in little to no selection in the blocks with weak signals.

While the previous methods all struggle with signal interference, one simple way to reduce the signal interference between blocks is to perform separate Lassos for each data view. Since independently-estimated blocks cannot possibly interfere with one another, this method addresses both scaling and signal interference issues. It also avoids the problem of ultra-high-dimensionality. However, each view by itself usually does not contain sufficient information to explain much of the variability in the response, and we lose the advantages of data integration.

Beyond the Lasso-type methods, there are selection methods with non-convex penalties such as SCAD and MCP (Fan and Li, 2001; Zhang et al., 2010). These non-convex penalties tend to scale better than the Lasso-type penalties but are still not variable selection consistent in the ultra-high-dimensional regime, especially for non-Gaussian responses and highly correlated data. We investigate MCP/SCAD feature selection in Table 6 in the Appendix, but our primary focus in this paper is on the more commonly used Lasso-type penalties.

3 Block Randomized Adaptive Iterative Lasso

Driven by the many challenges and the lack of effective tools, we propose a new method for multi-view feature selection, the Block Randomized Adaptive Iterative Lasso (B-RAIL). For the sake of notation, suppose we observe the response vector

and multi-view data with views of potentially mixed types, samples, and total features. We will assume and typically for each view. Let denote the indices of the support, and let denote the columns of indexed by . We will introduce B-RAIL in the context of regression and later discuss its extension to graph selection.

Under the regression framework, the goal of B-RAIL can be viewed as two-fold: 1) to select features from each view that are associated with the response , and 2) to do so while avoiding the challenges discussed in Section 2. With this goal in mind, we briefly outline the B-RAIL algorithm in Algorithm 1 and summarize the key steps taken to overcome the current challenges.

Initialize and to have a fixed proportion of sparsity for .

Do until stops changing:

  • Set .

  • For , estimate blockwise, holding () and () fixed:

    1. Estimate the support of block :

      • Use stability selection with the randomized Lasso and adaptive penalties

    2. Given , estimate , the estimated non-zero coefficients values of block

Output .

Algorithm 1 Outline of Block - Randomized Adaptive Iterative Lasso

At a high-level, B-RAIL iterates across the data blocks and estimates each data block separately while holding all other blocks fixed. This iterative procedure is motivated by the advantages of performing separate Lassos - namely, that it mitigates the ultra-high-dimensionality and signal interference issues. Then, within each of the individual block estimations, B-RAIL first estimates the block’s support and subsequently, the coefficient values given the support. Here, B-RAIL leverages ideas from adaptive weighting schemes, stability selection, and the randomized Lasso in an attempt to reduce the scaling discrepancies and domain-specific beta-min issues.

We next provide the full B-RAIL algorithm in Algorithm 2 and proceed to discuss each step of the B-RAIL algorithm in greater detail.

  1. Initialization:

    • Set .

    • Initialize , where for .

    • Re-order blocks in the data , if necessary.

  2. Do:

    • Set .

    • For , estimate blockwise, holding () and () fixed:

      1. Update , the estimated support for block :

        1. Set adaptive regularization:

          (3.1)

          where

          (3.2)

          and is the estimated Fisher information matrix.

        2. Perform stability selection:

          1. Take bootstrap samples: .

          2. Solve the randomized Lasso: For each ,

            (3.3)

            where and .

          3. Select features at stability level :

            (3.4)
      2. Update , the estimated non-zero coefficients for block :

        (3.5)

        where .

  3. Until: , where denotes the signed support of a vector.

  4. Output: .

Algorithm 2 Block - Randomized Adaptive Iterative Lasso (B-RAIL)

3.1 Initialization

In our proposed B-RAIL algorithm, the coefficients are first initialized to a pre-specified sparsity level (e.g. non-zero features per block) by fitting separate Lasso path regressions for each block. We then must specify the order of the blocks to iterate over. This ordering can affect the performance of B-RAIL as accurate estimation of the first block improves subsequent estimations of other blocks. Since previous Lasso results guarantee a high probability of support recovery when is sufficiently large compared to , we strongly advise estimating the blocks with the smallest first, especially if . If dimensions of all the blocks are of similar sizes or much larger than , we recommend starting with Gaussian blocks, which tend to have better support recovery than non-Gaussian blocks.

3.2 Estimating Support ()

After initialization, we repeatedly iterate across the data blocks and estimate the support of each block, holding the estimates of all other blocks fixed. This block-wise support estimation is performed using stability selection with the randomized Lasso (Meinshausen and Bühlmann, 2010). As given by step 1(b) in Algorithm 2, we solve the Lasso times using the bootstrap and randomized penalty terms, and we threshold the stability score (3.4) to select the most stable features. Since the randomized Lasso is known to be feature selection consistent even when the Lasso’s irrepresentable condition is violated (Meinshausen and Bühlmann, 2010), B-RAIL leverages the randomized penalties and stability in order to adequately handle correlated features.

3.3 Adaptive Regularization ()

Like the usual randomized Lasso, our penalty term in (3.3) includes a random weight . However, in order to account for the scaling discrepancies, signal variability, and domain differences between blocks, we introduce a block-specific adaptive penalty in (3.3) as well. For feature in block , we define the adaptive weight

(3.1)

where

(3.2)

Here, is the Fisher information matrix corresponding to the GLM of the response , and

denotes the maximum eigenvalue.

In this definition of , there are two moving parts. First, the multiplicative scheme in (3.1) encourages previously selected features to remain selected while still allowing all features to freely enter or exit the model. Secondly, accounts for the heterogeneity of multi-view data and helps to mitigate the challenges of Section 2.

Though the exact form of was derived experimentally, can be interpreted as the product of three factors, each of which is rooted in solid theoretical foundations. For instance, part (c) of (3.2) is closely related to the theoretical bound on the regularization parameter needed for selection consistency of the Lasso (Zhao and Yu, 2006; Meinshausen and Bühlmann, 2006). The ratio of eigenvalues in part (a) (i.e. the domain correction term) is motivated by the theoretical conditions imposed on the Fisher information matrix for exponential family distributions (Yang et al., 2015), and part (b) of (3.2) (i.e. the signal correction term) can be viewed as the average signal in block since is derived from the theoretical sparsity level within each block (Bunea et al., 2007). While in theory this specific combination of weights should correct for domain differences and signal differences across views, we reinforce our choice of through strong empirical results in Section 4.

3.4 Coefficient Estimations

After estimating the block-wise support using the randomized Lasso with adaptive weights, we seek to estimate the coefficients of the support as accurately as possible since these values are used in future block estimations and iterations of B-RAIL. We thus fit a penalized regression model with a small ridge penalty in (3.5). The small penalty introduces only a little bias to ensure that we can still estimate coefficients when the selected support is greater than .

3.5 Convergence

We finally declare convergence of B-RAIL’s iterative block-by-block estimation procedure when the estimated support remains unchanged. Our empirical analysis indicates that B-RAIL has quick support convergence, and we provide one example of this fast convergence in Figure 17 in the Appendix. Using the ovarian cancer simulation (see Section 5) for three different responses (Gaussian, binary, and Poisson), we report that the average number of iterations until convergence is between 4 and 5 with the maximum number of iterations reaching 15 (over 100 runs). These ranges are similar for all designs, confirming B-RAIL’s fast convergence.

3.6 B-RAIL Summary

While we have introduced B-RAIL under the regression framework, B-RAIL can be naturally extended to estimate mixed graphical models via a penalized node-wise regression approach (Meinshausen and Bühlmann, 2006). As in the motivating example in Section 2, we can use B-RAIL to estimate the neighborhood of each node separately via penalized regressions and then combine the neighborhoods using an“AND” or “OR” rule to obtain the graph.

In either the regression or graph selection setting, our B-RAIL algorithm deliberately takes steps to exploit the practical advantages of existing Lasso-type methods while avoiding the drawbacks described in Section 2. For instance, by performing iterative block-by-block estimations, B-RAIL inherits the advantages of performing separate Lassos and avoids the issues of ultra-high-dimensionality and signal interference. We also mitigate the scaling and beta-min problems by engineering adaptive penalties in B-RAIL to correct for domain and signal differences between blocks. Still, selecting an appropriate amount of regularization can be very challenging in practice due to highly correlated data. B-RAIL thus incorporates randomized stability selection, which is known to be feature selection consistent under stronger and more complex dependencies than can be handled by the Lasso. This boosts the support estimation of correlated features, and together, with the previous components, B-RAIL effectively overcomes the many practical challenges of multi-view feature selection and lends itself to a plethora of data integration applications.

4 Numerical Studies

We next reinforce the theoretically-guided choices in our B-RAIL construction and demonstrate its effectiveness through extensive simulations. In these simulations, we evaluate B-RAIL against four common Lasso-based parametric methods: (i) Lasso with a global penalty for all blocks, (ii) Lasso with separate penalties for each block, (iii) separate Lassos for each block, and (iv) Adaptive Lasso. For the Adaptive Lasso, we use ridge weights as they are better adapted to handle correlated features. Moreover, to avoid biases from penalty selection methods, we use oracle information to select features in the Lasso-based models. That is, if is the number of true features in the simulation, we fit the full path of the Lasso and select the first features. In the case of the Lasso with separate penalties, we select the features with the largest number of true positives. We do not, however, use oracle information for B-RAIL. Instead, B-RAIL internally selects the number of features using stability selection with the threshold as outlined in Algorithm 2.

To systematically compare these methods, we simulate from various designs of with three blocks - namely, a Gaussian , Bernoulli , and Poisson block - and various types of GLM responses . Due to the popular use of the Gaussian, Bernoulli, and Poisson GLMs, we run simulations with responses from each of these families. For the Gaussian response, we fit the linear model , where . For the binary and Poisson responses, we use copula transformations (Nelsen, 1999) to simulate .

In addition to these response models for , we consider four simulation designs for to understand model behavior under different assumptions. The four simulations designs are: (i) iid features, (ii) independent features with non-constant variance, (iii) correlated features with covariance structure from a Block Directed Markov Random Field, and (iv) a real data inspired simulation with features from The Cancer Genome Atlas (TCGA) ovarian cancer study. We elaborate on each of these designs below.

Note in all of the simulations, we set the number of true features in each covariate block to 10, and the magnitudes of the true features are drawn from with random sign assignment. Unless stated otherwise, we simulate samples and features. We also center and scale the design matrix before estimation.

iid Design.   For each of the three covariate blocks, we simulate samples from iid features. Here, for the high-dimensional design and for the low-dimensional design.

Heteroscedasticity Design.

  In this design, we assume that the features are independent but have non-constant variance. For the Gaussian block, the entries in each column are simulated from the normal distribution

, where . In the Bernoulli block, each column is simulated independently with entries drawn from with . Similarly, in the Poisson block, the mean

of each column is drawn from the Gamma distribution

(using the shape/scale parameterization).

Block Directed Graph Design.   We next drop the independence assumption and use a Block Directed Markov Random Field (BDMRF) (Yang et al., 2014a) graph to simulate correlated features. In this case, is simulated via Gibbs sampling with the partial ordering of the underlying mixed graph given by , where is a pairwise Gaussian conditional random field (CRF), is a pairwise Ising CRF (Ravikumar et al., 2010), and is a pairwise Poisson Markov Random Field (MRF) (Yang et al., 2013, 2012). We set high correlations for the Gaussian and Poisson blocks and low correlations for the binary block and between block structure.

Ovarian Cancer Inspired Simulation Design.   In an attempt to simulate data closest to real world scenarios, we take the continuous-valued miRNA data, proportion-valued methylation data, and the count-valued RNASeq data from The Cancer Genome Atlas (TCGA) ovarian cancer database (The Cancer Genome Atlas Research Network, 2011) to be our covariates. After merging and preprocessing the TCGA ovarian cancer data (refer to Section 5 for details), we arrive at samples and , , and features.

iid Case, p=300 Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 1.00 (1.1e-3) 0.00 (0.0e-0) 1.00 (1.7e-3) 0.00 (0.0e-0) 1.00 (1.4e-3) 0.00 (0.0e-0) 1.00 (1.4e-3) 0.00 (0.0e-0) Lasso - (oracle) 0.87 (1.2e-3) 0.12 (1.9e-3) 0.90 (1.4e-3) 0.12 (5.3e-3) 0.81 (2.9e-3) 0.20 (5.2e-4) 0.90 (0.0e-0) 0.03 (4.5e-3) Lasso - (oracle) 0.92 (1.6e-3) 0.08 (1.6e-3) 0.96 (4.8e-3) 0.00 (0.0e-0) 0.90 (0.0e-0) 0.15 (4.5e-3) 0.90 (0.0e-0) 0.08 (4.4e-3) Separate Lasso (oracle) 0.75 (1.6e-3) 0.24 (4.9e-4) 0.80 (0.0e-0) 0.20 (0.0e-0) 0.70 (1.7e-3) 0.30 (5.7e-4) 0.77 (4.8e-3) 0.21 (1.4e-3) Adaptive Lasso (oracle) 1.00 (0.0e-0) 0.00 (0.0e-0) 1.00 (0.0e-0) 0.00 (0.0e-0) 1.00 (0.0e-0) 0.00 (0.0e-0) 1.00 (0.0e-0) 0.00 (0.0e-0) iid Case, p=900 Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.94 (7.3e-3) 0.12 (1.4e-2) 0.98 (5.0e-3) 0.14 (1.7e-2) 0.95 (9.9e-3) 0.08 (1.1e-2) 0.90 (9.2e-3) 0.12 (1.5e-2) Lasso - (oracle) 0.63 (3.3e-4) 0.37 (5.0e-4) 0.80 (0.0e-0) 0.20 (0.0e-0) 0.50 (1.0e-3) 0.50 (1.4e-3) 0.60 (0.0e-0) 0.40 (0.0e-0) Lasso - (oracle) 0.74 (1.7e-3) 0.26 (1.7e-3) 0.98 (4.4e-3) 0.19 (3.3e-3) 0.63 (7.4e-3) 0.36 (6.5e-3) 0.62 (3.9e-3) 0.20 (7.4e-3) Separate Lasso (oracle) 0.53 (9.6e-4) 0.46 (1.3e-3) 0.60 (0.0e-0) 0.38 (4.4e-3) 0.49 (2.9e-3) 0.51 (1.6e-3) 0.50 (0.0e-0) 0.50 (0.0e-0) Adaptive Lasso (oracle) 0.66 (9.6e-4) 0.34 (6.9e-4) 0.79 (3.1e-3) 0.28 (1.3e-3) 0.50 (3.5e-3) 0.44 (1.7e-3) 0.70 (0.0e-0) 0.30 (9.0e-4) Non-constant Variance (Heteroscedasticity) Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.97 (3.3e-4) 0.01 (1.3e-3) 0.90 (1.0e-3) 0.00 (0.0e-0) 1.00 (0.0e-0) 0.00 (0.0e-0) 1.00 (0.0e-0) 0.02 (3.7e-3) Lasso - (oracle) 0.77 (2.2e-3) 0.21 (2.2e-3) 0.88 (3.9e-3) 0.19 (2.9e-3) 0.83 (4.7e-3) 0.06 (5.4e-3) 0.60 (0.0e-0) 0.37 (3.6e-3) Lasso - (oracle) 0.83 (0.0e-0) 0.17 (0.0e-0) 0.90 (0.0e-0) 0.11 (2.8e-3) 1.00 (0.0e-0) 0.18 (2.4e-3) 0.60 (0.0e-0) 0.22 (4.9e-3) Separate Lasso (oracle) 0.56 (1.4e-3) 0.43 (1.1e-3) 0.58 (4.3e-3) 0.41 (1.9e-3) 0.80 (0.0e-0) 0.19 (2.3e-3) 0.30 (0.0e-0) 0.70 (1.4e-3) Adaptive Lasso (oracle) 0.86 (1.3e-3) 0.13 (1.2e-3) 0.90 (1.0e-3) 0.10 (1.8e-3) 1.00 (0.0e-0) 0.11 (3.4e-3) 0.69 (3.4e-3) 0.20 (5.6e-3) Block Directed Graph Structure Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.88 (4.8e-3) 0.16 (1.1e-2) 0.79 (4.8e-3) 0.12 (1.4e-2) 0.96 (6.8e-3) 0.12 (7.0e-3) 0.89 (5.2e-3) 0.22 (1.4e-2) Lasso - (oracle) 0.72 (1.9e-3) 0.27 (1.9e-3) 0.88 (5.8e-3) 0.17 (3.2e-3) 1.00 (0.0e-0) 0.28 (3.5e-3) 0.30 (0.0e-0) 0.42 (4.2e-3) Lasso - (oracle) 0.77 (0.0e-0) 0.23 (0.0e-0) 1.00 (0.0e-0) 0.20 (4.0e-3) 1.00 (0.0e-0) 0.24 (5.4e-3) 0.30 (0.0e-0) 0.26 (1.5e-2) Separate Lasso (oracle) 0.51 (2.0e-3) 0.46 (1.9e-3) 0.79 (3.1e-3) 0.18 (4.6e-3) 0.55 (5.0e-3) 0.42 (2.8e-3) 0.20 (0.0e-0) 0.79 (1.3e-3) Adaptive Lasso (oracle) 0.70 (0.0e-0) 0.30 (3.4e-4) 0.80 (0.0e-0) 0.20 (0.0e-0) 1.00 (0.0e-0) 0.29 (1.3e-3) 0.30 (0.0e-0) 0.49 (3.0e-3) OV Data Total Continuous Proportion Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.95 (2.9e-3) 0.11 (4.7e-3) 0.85 (8.7e-3) 0.23 (4.9e-3) 1.00 (0.0e-0) 0.09 (8.2e-3) 1.00 (1.0e-3) 0.03 (4.4e-3) Lasso - (oracle) 0.57 (8.5e-4) 0.43 (1.2e-3) 0.80 (0.0e-0) 0.38 (0.0e-0) 0.41 (2.6e-3) 0.31 (6.2e-3) 0.50 (0.0e-0) 0.54 (1.5e-3) Lasso - (oracle) 0.65 (1.6e-3) 0.35 (1.6e-3) 0.80 (1.0e-3) 0.37 (2.3e-3) 0.51 (3.6e-3) 0.00 (1.7e-3) 0.63 (4.6e-3) 0.48 (4.6e-3) Separate Lasso (oracle) 0.40 (1.6e-3) 0.60 (1.3e-3) 0.58 (4.1e-3) 0.41 (1.9e-3) 0.30 (0.0e-0) 0.70 (0.0e-0) 0.30 (2.0e-3) 0.69 (2.3e-3) Adaptive Lasso (oracle) 0.93 (1.2e-3) 0.09 (1.1e-3) 0.80 (0.0e-0) 0.12 (2.8e-3) 0.98 (3.6e-3) 0.00 (1.0e-3) 1.00 (0.0e-0) 0.09 (1.7e-3)

Table 1: We compare various selection methods under 5 different simulation scenarios. For each scenario, we simulate with three blocks (continuous, binary, counts) and a Gaussian response

. We report the true positive rate (TPR) and false discovery proportion (FDP) for overall feature recovery and individual block recoveries, averaged across 200 runs with standard errors in parentheses. We bold the best overall

values for each simulation scenario. Note that we used oracle information for the Lasso-type methods.

Binary Response Block Directed Graph Structure Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.81 (8.6e-3) 0.07 (5.2e-3) 0.80 (0.0e-0) 0.04 (5.4e-3) 0.97 (5.7e-3) 0.09 (6.2e-3) 0.68 (2.1e-2) 0.07 (1.2e-2) Lasso - (oracle) 0.70 (0.0e-0) 0.29 (1.3e-3) 0.90 (0.0e-0) 0.23 (4.1e-3) 0.90 (0.0e-0) 0.37 (2.0e-3) 0.30 (0.0e-0) 0.18 (1.3e-2) Lasso - (oracle) 0.70 (8.5e-4) 0.30 (8.5e-4) 0.89 (2.6e-3) 0.21 (4.3e-3) 0.90 (0.0e-0) 0.27 (3.6e-3) 0.30 (0.0e-0) 0.52 (7.4e-3) Separate Lasso (oracle) 0.50 (1.3e-3) 0.46 (2.3e-3) 0.80 (0.0e-0) 0.20 (0.0e-0) 0.51 (3.9e-3) 0.41 (7.0e-3) 0.20 (0.0e-0) 0.79 (1.7e-3) Adaptive Lasso (oracle) 0.70 (9.6e-4) 0.29 (1.4e-3) 0.81 (2.9e-3) 0.20 (1.3e-3) 1.00 (0.0e-0) 0.36 (2.5e-3) 0.30 (0.0e-0) 0.27 (5.7e-3) Poisson Response Block Directed Graph Structure Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.81 (8.6e-3) 0.07 (5.2e-3) 0.80 (0.0e-0) 0.04 (5.4e-3) 0.97 (5.7e-3) 0.09 (6.2e-3) 0.68 (2.1e-2) 0.07 (1.2e-2) Lasso - (oracle) 0.70 (0.0e-0) 0.29 (1.3e-3) 0.90 (0.0e-0) 0.23 (4.1e-3) 0.90 (0.0e-0) 0.37 (2.0e-3) 0.30 (0.0e-0) 0.18 (1.3e-2) Lasso - (oracle) 0.70 (8.5e-4) 0.30 (8.5e-4) 0.89 (2.6e-3) 0.21 (4.3e-3) 0.90 (0.0e-0) 0.27 (3.6e-3) 0.30 (0.0e-0) 0.52 (7.4e-3) Separate Lasso (oracle) 0.50 (1.3e-3) 0.46 (2.3e-3) 0.80 (0.0e-0) 0.20 (0.0e-0) 0.51 (3.9e-3) 0.41 (7.0e-3) 0.20 (0.0e-0) 0.79 (1.7e-3) Adaptive Lasso (oracle) 0.70 (9.6e-4) 0.29 (1.4e-3) 0.81 (2.9e-3) 0.20 (1.3e-3) 1.00 (0.0e-0) 0.36 (2.5e-3) 0.30 (0.0e-0) 0.27 (5.7e-3)

Table 2: We compare various selection methods under the block directed graph simulation design with binary responses and with Poisson responses. We report the TPR and FDP for feature recovery, averaged across 200 runs with standard errors in parentheses. We bold the best overall values for each simulation scenario.

Under each of these simulation scenarios, we evaluate the performance of B-RAIL and the oracle Lasso-type methods by reporting the true positive rate (TPR) and false discovery proportion (FDP) for overall feature recovery and individual block recoveries. Due to the large number of features, we use FDP, defined as the number of false positives divided by total the number of recovered non-zero features, instead of the false discovery rate.

We summarize the results of our simulations with Gaussian responses in Table 1 and those with binary and Poisson responses in Table 2. Note that for the binary and Poisson responses, we show the block directed graph results here and provide the other simulation results in the Appendix. We also highlight in bold the TPR/FDP combination with the lowest TPR*(1-FDP) value for overall recovery. In almost all scenarios, the results in Table 1 and Table 2 indicate that B-RAIL (with no oracle information) is able to achieve a higher TPR and lower FDP than its competitive Lasso-type methods with oracle information.

When oracle information is unavailable, model selection techniques can introduce additional errors and further complicate feature selection. Table 3 shows one such case and compares the block directed graph simulation performance of B-RAIL against the Lasso-type methods using 5-fold cross-validation, extended BIC, and stability selection to select the penalty parameters. We also include the oracle estimators for the same set of simulations to emphasize the large decrease in performance when the Lasso-type methods do not have oracle information. These simulations indicate that cross-validation tends to over-select the number of features in the model while extended BIC under-selects, and stability selection performs the best but pales in comparison to oracle selection. B-RAIL, in contrast, outperforms the Lasso-type methods even when oracle selection is used for these competitive methods. Additional simulations, confirming the strong empirical performance of B-RAIL, are provided in the Appendix.

Block Directed Graph Structure Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.86 (1.2e-2) 0.20 (2.8e-2) 0.78 (9.2e-3) 0.15 (3.6e-2) 0.93 (1.9e-2) 0.16 (1.8e-2) 0.88 (1.2e-2) 0.29 (4.3e-2) Lasso - (oracle) 0.72 (3.5e-3) 0.27 (4.8e-3) 0.87 (1.1e-2) 0.20 (5.0e-3) 1.00 (0.0e-0) 0.40 (1.6e-2) 0.30 (0.0e-0) 0.21 (5.0e-3) Lasso - (oracle) 0.77 (0.0e-0) 0.23 (0.0e-0) 1.00 (0.0e-0) 0.24 (1.4e-2) 1.00 (0.0e-0) 0.32 (2.7e-2) 0.30 (0.0e-0) 0.14 (2.2e-2) Separate Lasso (oracle) 0.51 (5.0e-3) 0.44 (5.6e-3) 0.79 (8.2e-3) 0.18 (1.2e-2) 0.55 (1.1e-2) 0.41 (5.0e-3) 0.20 (0.0e-0) 0.73 (1.1e-2) Adaptive Lasso (oracle) 0.70 (0.0e-0) 0.30 (2.3e-3) 0.80 (0.0e-0) 0.20 (0.0e-0) 1.00 (0.0e-0) 0.42 (9.2e-3) 0.30 (0.0e-0) 0.27 (1.1e-2) 5 Fold Cross Validation Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP Lasso - 0.98 (3.0e-3) 0.62 (4.1e-3) 1.00 (0.0e-0) 0.58 (6.3e-3) 1.00 (0.0e-0) 0.63 (4.0e-3) 0.95 (8.9e-3) 0.63 (3.9e-3) Lasso - 0.60 (2.3e-3) 0.17 (2.2e-3) 0.80 (0.0e-0) 0.13 (3.6e-3) 0.69 (6.8e-3) 0.07 (6.3e-3) 0.30 (0.0e-0) 0.40 (1.0e-3) Separate Lasso 0.37 (8.1e-3) 0.28 (1.5e-2) 0.58 (9.9e-3) 0.05 (9.9e-3) 0.54 (1.8e-2) 0.38 (2.0e-2) 0.00 (0.0e-0) 0.47 (5.0e-2) Adaptive Lasso 0.73 (3.0e-3) 0.37 (7.8e-3) 0.87 (6.8e-3) 0.23 (6.4e-3) 1.00 (2.0e-3) 0.39 (8.1e-3) 0.31 (2.9e-3) 0.55 (9.5e-3) Extended BIC Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP Lasso - 0.03 (0.0e-0) 0.00 (0.0e-0) 0.10 (0.0e-0) 0.00 (0.0e-0) 0.00 (0.0e-0) 0.00 (0.0e-0) 0.00 (0.0e-0) 0.00 (0.0e-0) Lasso - 0.58 (2.0e-3) 0.18 (3.1e-3) 0.80 (0.0e-0) 0.15 (5.3e-3) 0.64 (5.9e-3) 0.04 (6.1e-3) 0.30 (0.0e-0) 0.42 (4.0e-3) Separate Lasso 0.18 (1.7e-3) 0.07 (7.9e-3) 0.50 (1.0e-3) 0.00 (0.0e-0) 0.04 (4.9e-3) 0.00 (0.0e-0) 0.00 (0.0e-0) 0.47 (5.0e-2) Adaptive Lasso 0.30 (0.0e-0) 0.00 (0.0e-0) 0.50 (0.0e-0) 0.00 (0.0e-0) 0.40 (0.0e-0) 0.00 (0.0e-0) 0.00 (0.0e-0) 0.00 (0.0e-0) Stability Selection Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP Lasso - 0.67 (2.4e-3) 0.01 (1.8e-3) 0.80 (0.0e-0) 0.02 (4.2e-3) 0.92 (6.9e-3) 0.00 (9.1e-4) 0.29 (2.4e-3) 0.00 (0.0e-0) Lasso - 0.58 (2.7e-3) 0.04 (2.3e-3) 0.80 (0.0e-0) 0.09 (4.7e-3) 0.66 (6.5e-3) 0.00 (0.0e-0) 0.28 (4.3e-3) 0.00 (2.5e-3) Separate Lasso 0.42 (3.9e-3) 0.28 (6.6e-3) 0.69 (2.7e-3) 0.13 (7.2e-3) 0.55 (1.1e-2) 0.39 (8.8e-3) 0.00 (0.0e-0) 0.29 (4.6e-2) Adaptive Lasso 0.65 (1.7e-3) 0.08 (2.6e-3) 0.80 (0.0e-0) 0.11 (0.0e-0) 0.84 (5.1e-3) 0.07 (6.0e-3) 0.30 (0.0e-0) 0.01 (4.3e-3)

Table 3: We compare feature recovery for B-RAIL and Lasso-type methods with various model selection methods. Here, we simulate from the block directed graph simulation design with Gaussian responses and report the TPR and FDP, averaged across 200 runs with standard errors in parentheses. We highlight the best overall values in bold. Note for stability selection, we initialize using the selected by CV.

5 Case Study: Integrative Genomics of Ovarian Cancer

One promising practical application for our research on multi-view feature selection lies in integrative cancer genomics. Here, scientists seek to integrate data from multiple sources of high-throughput genomic data to more holistically model the genomic systems in cancer cells, leading to a better understanding of disease mechanisms and possible therapies.

In this case study, we seek to integrate three different types of genomic data to study how epigenetics and short RNAs influence the gene regulatory system in ovarian cancer. Specifically, we are interested in discovering miRNAs and CpG sites, which affect the gene expression of well-known oncogenes in ovarian cancer and hence can serve as potential drug targets for blocking or decreasing the expression of these oncogenes. Driven by this goal of discovering potential drug targets, we use our proposed B-RAIL method to estimate the integrative ovarian cancer gene regulatory network with the specific intention of identifying miRNAs and CpG sites that are directly linked to known oncogenes of ovarian cancer.

In this investigation, we integrate the following three data sets: (1) count-valued gene expression measured via RNASeq, (2) continuous (Gaussian) miRNA expression, and (3) proportion-valued DNA methylation data from The Cancer Genome Atlas (TCGA) ovarian cancer study, which is publicly available (The Cancer Genome Atlas Research Network, 2011). The TCGA data originally contained genes, CpG sites, and miRNAs but only common patients across all three data sets of interest. We hence reduced the number of features to manageable sizes by first filtering features according to their association with several important clinical outcomes - survival via a univariate cox model, chemo-resistance via a univariate logistic model, and recurrence via a univariate logistic model. In addition, we transformed the RNASeq data using the Kolmogorov-Smirnov Test () to alleviate the problem of very large counts (up to 20,000). This preprocessing yielded genes, CpG sites, and miRNAs in the RNASeq, methylation, and miRNA data sets, respectively. Lastly, per the recommendation of scientists, we included 20 additional highly mutated genes that were experimentally identified as important in ovarian cancer, resulting in genes in the RNASeq data set.

To estimate the integrated ovarian cancer network, we fit a Block Directed Markov Random Field (BDMRF) model (Yang et al., 2014a) using B-RAIL to estimate the neighborhood of each node in the graph. Note that since miRNAs and methylation are both gene regulatory mechanisms, miRNAs and methylation can affect expression levels (measured via RNASeq), but the converse is not possible. To agree with this known physical mechanism, we set the partial ordering of the mixed graph underlying BDMRF as , where is a pairwise Ising MRF for the proportion-valued methylation data, is a pairwise Gaussian MRF for the continuous miRNA data, and is a pairwise Poisson CRF for the count-valued RNASeq data. However, we recall that only negative conditional dependencies are permitted in the Poisson MRF and CRF models. Since this constraint is unrealistic for genomics data, we fit a Sub-Linear Poisson CRF, in lieu of the usual Poisson CRF, to allow for both positive and negative conditional dependencies Yang et al. (2013). Under this specified BDMRF model, we employ node-wise neighborhood selection (Meinshausen and Bühlmann, 2006; Yang et al., 2015) using B-RAIL to learn the edge structure of the integrated network.

Figure 7: We present the integrated ovarian cancer genetic network estimated by the B-RAIL algorithm. The blue nodes denote miRNAs, green nodes denote CpG sites, red nodes denote gene expression via RNASeq, and the size of each node is proportional to the number of connected first neighbors.
Figure 8: Sub-network for BRCA1 gene
Figure 9: Sub-network for miR23b
Figure 10: We zoom in on the sub-networks for two known biomarkers, which have been previously implicated in ovarian cancer studies. Key mutated cancer biomarkers such as miR23b and BRCA1 are found to have many inter-connections to biomarkers, circled in red, that are consistent with the cancer literature. (Buchholtz et al., 2014; Obermayr et al., 2010; Giannakakis et al., 2008; Freier, 2016; Gao et al., 2009; Toyama et al., 1999; Tong et al., 2017)

Our overall BRAIL-estimated network is presented in Figure 7, and in Figure 10, we more closely examine the relationships between the oncogenes, miRNAs, and CpG sites by zooming in on the sub-networks for the well-known oncogene, BRCA1, and its direct neighbor, miRNA23b. Both BRCA1 and miRNA23b are well-known biomarkers and have been implicated in several ovarian cancer studies (Antoniou et al., 2003; King et al., 2003; BRCA, 1994; Li et al., 2014; Yan et al., 2016; Geng et al., 2012). Moreover, miRNA23b is known to play a key role in p53 signaling (via TP53) (Boren et al., 2009), agreeing with the estimated edge between the TP53 oncogene and miRNA23b in Figure 10(b).

Aside from this link however, the estimated edges between genes, CpG sites, and miRNAs in Figure 7 are largely unexplored and unknown by researchers since B-RAIL is one of the first practical approaches for multi-view feature selection. Nevertheless, we can partially validate our B-RAIL-estimated network by highlighting the many genes with verified connections in the ovarian cancer and cancer proliferation/suppression literatures. In Figure 10, we circle in red this collection of implicated genes, which includes LDOC1, SGCB, and miRNA210 (Buchholtz et al., 2014; Obermayr et al., 2010; Giannakakis et al., 2008).

As we have noted, there is substantial evidence in the scientific literature, suggesting that our proposed B-RAIL algorithm successfully identified promising candidates as well as known biomarkers involved in ovarian cancer. By taking into account the relationships between genes, miRNAs, and CpG sites, our integrative analysis via B-RAIL leads to valuable insights beyond a single biomarker type and to novel discoveries of direct connections between miRNAs, CpG sites, and known oncogenes, which may aid the development of targeted drug therapies for ovarian cancer. This is the first integrative analysis of its kind, and future experiments studying the connections between known ovarian cancer oncogenes and candidate miRNAs and CpG sites would be of great value to validate our findings.

6 Discussion

Though we have primarily focused on applications to integrative genomics in this work, B-RAIL is not limited to this context. B-RAIL can be applied to any field that yields high-dimensional multi-view data, and with the rapid advances in technologies, we expect B-RAIL to have a growing and far-reaching impact in fields such as imaging genetics, national security, climate studies, spatial statistics, Internet data, marketing, and economics. B-RAIL is also a versatile tool that can be used to for any sparse regression or graph selection problem in this multi-view context.

In addition to developing an effective data integration tool for multi-view feature selection, our work addresses the many difficulties of performing multi-view feature selection in practice. These practical challenges were severely under-studied prior to this work, but we partially resolve this gap, identifying four root challenges which interact with one another to impede recovery. Throughout our investigation of these practical challenges, we provide strong empirical evidence of the existence as well as the adverse consequences of such challenges. However, the theoretical underpinnings of these issues are still unknown. Understanding exactly how challenges such as shrinkage noise and the beta-min condition are influenced by varying domains and signals would be of great benefit to the field of data integration as a whole. We also highlight that while the Lasso has been well-studied under Gaussianity and idealized assumptions, the increasing abundance of correlated non-Gaussian data in multi-view settings requires a greater push for theoretical studies on feature selection with heterogeneous data and the GLM Lasso.

Overall, we have demonstrated many challenges posed by multi-view feature selection, and in our investigation of these challenges, we opened up new avenues for future theoretical work to rigorously understand how the heterogeneity of multi-view data complicates feature selection. Driven by these challenges and the ineffectiveness of existing methods, we developed a practical solution to overcome the current challenges. Our method, B-RAIL, is one of the first practical tools for multi-view feature selection and is grounded in deep theoretical foundations. With its versatility and strong empirical performance, B-RAIL facilitates impactful integrative analyses across a broad spectrum of fields.

Acknowledgments

Y.B. acknowledges support from the NIH/NCI T32 CA096520 training program in Biostatistics for Cancer Research, grant number 5T32CA09652011. T.T. acknowledges support from ARO, grant number W911NF1710005. GA acknowledges support from NSF DMS-1554821, NSF NeuroNex-1707400, and NSF DMS-1264058. We also thank Zhandong Liu, Ying-Wooi Wan, and Matthew L. Anderson at Baylor College of Medicine for thoughtful discussions related to this work.

References

  • Acar, Kolda and Dunlavy (2011) [author] Acar, EvrimE., Kolda, Tamara GT. G. Dunlavy, Daniel MD. M. (2011). All-at-once optimization for coupled matrix and tensor factorizations. arXiv preprint arXiv:1105.3422.
  • Allen (1974) [author] Allen, David MD. M. (1974). The relationship between variable selection and data agumentation and a method for prediction. Technometrics 16 125–127.
  • Antoniou et al. (2003) [author] Antoniou, AnthonyA., Pharoah, PDPP., Narod, StevenS., Risch, Harvey AH. A., Eyfjord, Jorunn EJ. E., Hopper, JLJ., Loman, NiklasN., Olsson, HåkanH., Johannsson, OO., Borg, ÅkeÅ. et al. (2003). Average risks of breast and ovarian cancer associated with BRCA1 or BRCA2 mutations detected in case series unselected for family history: a combined analysis of 22 studies. The American Journal of Human Genetics 72 1117–1130.
  • Boren et al. (2009) [author] Boren, ToddT., Xiong, YinY., Hakam, ArdeshirA., Wenham, RobertR., Apte, SachinS., Chan, GinaG., Kamath, Siddharth GS. G., Chen, Dung-TsaD.-T., Dressman, HollyH. Lancaster, Johnathan MJ. M. (2009). MicroRNAs and their target messenger RNAs associated with ovarian cancer response to chemotherapy. Gynecologic oncology 113 249–255.
  • BRCA (1994) [author] BRCA, Susceptibility GeneS. G. (1994). A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science 266 7.
  • Buchholtz et al. (2014) [author] Buchholtz, Marie-LuiseM.-L., Brüning, AnsgarA., Mylonas, IoannisI. Jückstock, JuliaJ. (2014). Epigenetic silencing of the LDOC1 tumor suppressor gene in ovarian cancer cells. Archives of gynecology and obstetrics 290 149–154.
  • Bühlmann et al. (2013) [author] Bühlmann, PeterP. et al. (2013). Statistical significance in high-dimensional linear models. Bernoulli 19 1212–1242.
  • Bunea et al. (2007) [author] Bunea, FlorentinaF., Tsybakov, AlexandreA., Wegkamp, MartenM. et al. (2007). Sparsity oracle inequalities for the Lasso. Electronic Journal of Statistics 1 169–194.
  • Chen and Chen (2012) [author] Chen, JiahuaJ. Chen, ZehuaZ. (2012). Extended BIC for small-n-large-P sparse GLM. Statistica Sinica 555–574.
  • Chen, Witten and Shojaie (2014) [author] Chen, ShizheS., Witten, Daniela MD. M. Shojaie, AliA. (2014). Selection and estimation for mixed graphical models. Biometrika 102 47–64.
  • Cheng et al. (2013) [author] Cheng, JieJ., Li, TianxiT., Levina, ElizavetaE. Zhu, JiJ. (2013). High-dimensional mixed graphical models. arXiv preprint arXiv:1304.2810.
  • Fan and Li (2001) [author] Fan, JianqingJ. Li, RunzeR. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association 96 1348–1360.
  • Freier (2016) [author] Freier, ChristophC. (2016). Role of regulatory T cells and associated chemokines in human gynecological tumors, PhD thesis, lmu.
  • Gao et al. (2009) [author] Gao, Jian-QingJ.-Q., Tsuda, YasuhiroY., Han, MinM., Xu, Dong-HangD.-H., Kanagawa, NaokoN., Hatanaka, YutakaY., Tani, YoichiY., Mizuguchi, HiroyukiH., Tsutsumi, YasuoY., Mayumi, TadanoriT. et al. (2009). NK cells are migrated and indispensable in the anti-tumor activity induced by CCL27 gene therapy. Cancer immunology, immunotherapy 58 291.
  • Geng et al. (2012) [author] Geng, JiongJ., Luo, HuiH., Pu, YiY., Zhou, ZhiminZ., Wu, XiaomingX., Xu, WenhuiW. Yang, ZhengxiangZ. (2012). Methylation mediated silencing of miR-23b expression and its role in glioma stem cells. Neuroscience letters 528 185–189.
  • Giannakakis et al. (2008) [author] Giannakakis, AntonisA., Sandaltzopoulos, RaphaelR., Greshock, JoelJ., Liang, ShunS., Huang, JiaJ., Hasegawa, KoseiK., Li, ChunshengC., O’Brien-Jenkins, AnnA., Katsaros, DionyssiosD., Weber, Barbara LB. L. et al. (2008). miR-210 links hypoxia with cell cycle regulation and is deleted in human epithelial ovarian cancer. Cancer biology & therapy 7 255–264.
  • Hall and Llinas (1997) [author] Hall, David LD. L. Llinas, JamesJ. (1997). An introduction to multisensor data fusion. Proceedings of the IEEE 85 6–23.
  • Haslbeck and Waldorp (2015) [author] Haslbeck, JonasJ. Waldorp, Lourens JL. J. (2015). mgm: Structure Estimation for time-varying mixed graphical models in high-dimensional data. arXiv preprint arXiv:1510.06871.
  • Jalali et al. (2011)

    [author] Jalali, AliA., Ravikumar, PradeepP., Vasuki, VishvasV. Sanghavi, SujayS. (2011). On learning discrete graphical models using group-sparse regularization. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics 378–387.

  • King et al. (2003) [author] King, Mary-ClaireM.-C., Marks, Joan HJ. H., Mandell, Jessica BJ. B. et al. (2003). Breast and ovarian cancer risks due to inherited mutations in BRCA1 and BRCA2. Science 302 643–646.
  • Lee and Hastie (2013) [author] Lee, JasonJ. Hastie, TrevorT. (2013). Structure learning of mixed graphical models. In Artificial Intelligence and Statistics 388–396.
  • Li et al. (2014) [author] Li, WeipingW., Liu, ZhongyuZ., Chen, LiL., Zhou, LiL. Yao, YuanqingY. (2014). MicroRNA-23b is an independent prognostic marker and suppresses ovarian cancer progression by targeting runt-related transcription factor-2. FEBS letters 588 1608–1615.
  • Liu, Roeder and Wasserman (2010) [author] Liu, HanH., Roeder, KathrynK. Wasserman, LarryL. (2010). Stability approach to regularization selection (stars) for high dimensional graphical models. In Advances in neural information processing systems 1432–1440.
  • Meinshausen and Bühlmann (2006) [author] Meinshausen, NicolaiN. Bühlmann, PeterP. (2006). High-dimensional graphs and variable selection with the lasso. The annals of statistics 1436–1462.
  • Meinshausen and Bühlmann (2010) [author] Meinshausen, NicolaiN. Bühlmann, PeterP. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 417–473.
  • Meinshausen and Yu (2009)

    [author] Meinshausen, NicolaiN. Yu, BinB. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics 246–270.

  • Nelsen (1999) [author] Nelsen, Roger BR. B. (1999). Introduction. In An Introduction to Copulas 1–4. Springer.
  • The Cancer Genome Atlas Research Network (2011) [author] The Cancer Genome Atlas Research Network (2011). Integrated genomic analyses of ovarian carcinoma. Nature 474 609.
  • Obermayr et al. (2010) [author] Obermayr, EvaE., Sanchez-Cabo, FatimaF., Tea, Muy-Kheng MM.-K. M., Singer, Christian FC. F., Krainer, MichaelM., Fischer, Michael BM. B., Sehouli, JalidJ., Reinthaller, AlexanderA., Horvat, ReinhardR., Heinze, GeorgG. et al. (2010). Assessment of a six gene panel for the molecular detection of circulating tumor cells in the blood of female cancer patients. BMC cancer 10 666.
  • Ravikumar et al. (2010)

    [author] Ravikumar, PradeepP., Wainwright, Martin JM. J., Lafferty, John DJ. D. et al. (2010). High-dimensional Ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics 38 1287–1319.

  • Shao (1993) [author] Shao, JunJ. (1993). Linear model selection by cross-validation. Journal of the American statistical Association 88 486–494.
  • Shen, Olshen and Ladanyi (2009) [author] Shen, RonglaiR., Olshen, Adam BA. B. Ladanyi, MarcM. (2009). Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 25 2906–2912.
  • Stone (1974) [author] Stone, MervynM. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36 111–133.
  • Su, Bogdan and Candes (2015) [author] Su, WeijieW., Bogdan, MalgorzataM. Candes, EmmanuelE. (2015). False discoveries occur early on the lasso path. arXiv preprint arXiv:1511.01957.
  • Tibshirani (1996) [author] Tibshirani, RobertR. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 267–288.
  • Tibshirani et al. (2013) [author] Tibshirani, Ryan JR. J. et al. (2013). The lasso problem and uniqueness. Electronic Journal of Statistics 7 1456–1490.
  • Tong et al. (2017) [author] Tong, ManM., Wong, Tin LokT. L., Luk, Steve Tin-ChiS. T.-C., Che, NoéliaN., Wong, Kai YauK. Y., Fung, Tsun MingT. M., Guan, Xin-YuanX.-Y., Lee, Nikki PN. P., Yuan, Yun-FeiY.-F., Lee, Terence KT. K. et al. (2017). Down-regulation of 4-hydroxyphenylpyruvate dioxygenate (HPD) contributes to the pathogenesis of hepatocellular carcinoma (HCC) through ERK/BCL-2 signalling activation.
  • Toyama et al. (1999) [author] Toyama, TatsuyaT., Iwase, HirotakaH., Watson, PeterP., Muzik, HuongH., Saettler, ElizabethE., Magliocco, AnthonyA., DiFrancesco, LisaL., Forsyth, PeterP., Garkavtsev, IgorI., Kobayashi, ShunzoS. et al. (1999). Suppression of ING1 expression in sporadic breast cancer. Oncogene 18.
  • Wainwright (2009) [author] Wainwright, Martin JM. J. (2009). Sharp thresholds for High-Dimensional and noisy sparsity recovery using - Constrained Quadratic Programming (Lasso). IEEE transactions on information theory 55 2183–2202.
  • Wang, Wainwright and Ramchandran (2010) [author] Wang, WeiW., Wainwright, Martin JM. J. Ramchandran, KannanK. (2010). Information-theoretic limits on sparse signal recovery: Dense versus sparse measurement matrices. IEEE Transactions on Information Theory 56 2967–2979.
  • Yan et al. (2016) [author] Yan, JingJ., Jiang, Jing-yiJ.-y., Meng, Xiao-NaX.-N., Xiu, Yin-LingY.-L. Zong, Zhi-HongZ.-H. (2016). MiR-23b targets cyclin G1 and suppresses ovarian cancer tumorigenesis and progression. Journal of Experimental & Clinical Cancer Research 35 31.
  • Yang et al. (2012) [author] Yang, EunhoE., Allen, GeneveraG., Liu, ZhandongZ. Ravikumar, Pradeep KP. K. (2012). Graphical models via generalized linear models. In Advances in Neural Information Processing Systems 1358–1366.
  • Yang et al. (2013) [author] Yang, EunhoE., Ravikumar, Pradeep KP. K., Allen, Genevera IG. I. Liu, ZhandongZ. (2013). On Poisson graphical models. In Advances in Neural Information Processing Systems 1718–1726.
  • Yang et al. (2014a) [author] Yang, EunhoE., Ravikumar, PradeepP., Allen, Genevera IG. I., Baker, YuliaY., Wan, Ying-WooiY.-W. Liu, ZhandongZ. (2014a). A General Framework for Mixed Graphical Models. arXiv preprint arXiv:1411.0288.
  • Yang et al. (2014b) [author] Yang, EunhoE., Baker, YuliaY., Ravikumar, PradeepP., Allen, GeneveraG. Liu, ZhandongZ. (2014b). Mixed graphical models via exponential families. In Artificial Intelligence and Statistics 1042–1050.
  • Yang et al. (2015)

    [author] Yang, EunhoE., Ravikumar, PradeepP., Allen, Genevera IG. I. Liu, ZhandongZ. (2015). Graphical models via univariate exponential family distributions. Journal of Machine Learning Research 16 3813–3847.

  • Yu (2013) [author] Yu, BinB. (2013). Stability. Bernoulli 19 1484–1500. 10.3150/13-BEJSP14
  • Yuan and Lin (2007) [author] Yuan, MingM. Lin, YiY. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35.
  • Zhang et al. (2010) [author] Zhang, Cun-HuiC.-H. et al. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics 38 894–942.
  • Zhang and Huang (2008)

    [author] Zhang, Cun-HuiC.-H. Huang, JianJ. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics 1567–1594.

  • Zhao and Yu (2006) [author] Zhao, PengP. Yu, BinB. (2006). On model selection consistency of Lasso. Journal of Machine learning research 7 2541–2563.
  • Zou (2006) [author] Zou, HuiH. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association 101 1418–1429.

Appendix A Additional Simulations

We provide the following figures and tables to supplement our simulations and to further support the strong empirical performance of B-RAIL.

Figure 10: MGM with EBIC
Figure 11: MGM with CV
Figure 12: The B-RAIL algorithm
Figure 13: We compare three different graph selection methods (B-RAIL and two methods from the mgm R package) when applied to real ovarian cancer genomics data. The data is comprised of three blocks: RNASeq (red), miRNA (blue), and methylation (green), with and overall .

To augment the motivating example in Figure 3, we provide comparisons of B-RAIL to two other mixed graphical selection methods from the mgm R package (Haslbeck and Waldorp, 2015) in Figure 13. mgm takes a node-wise neighborhood estimation approach, and for each node, mgm selects the Lasso regularization parameter using either the extended BIC or CV, fits a penalized GLM model, and applies additional thresholding to the estimated coefficients to remove noise. Here, we used the “AND” rule to combine estimated neighborhoods for all three graphs. (Note that we had to convert the proportion-valued methylation values into 0-1 binary values in order to comply with mgm package restrictions.)

From Figure 13, we see that mgm with the extended BIC selection criteria tends to under-select features while mgm with CV often over-selects. This agrees with our simulations and discussion of the Lasso-type model selection biases in Section 4. We also observe that like the Lasso-type methods in Figure 3, mgm with CV and extended BIC can result in imbalanced selection between the blocks.

Figure 14: Gaussian response, OV data
Figure 15: Binary response, OV data
Figure 16: Poisson response, OV data
Figure 17: For each block, we report signal recovery across iterations of the B-RAIL algorithm. The solid lines indicate the average feature recovery over 100 runs, and the faint dashed lines in the background represent individual runs. We represent the total number of selected features in green and the number of true positives in blue. The red horizontal line indicates the number of true features in each block, and the vertical black line represents the average number of iterations until B-RAIL converges over the 100 runs.

We next investigate the convergence of the B-RAIL algorithm. Our empirical analysis indicates that B-RAIL has quick support convergence for all simulation scenarios. We demonstrate this convergence for one type of simulation in Figure 17. Here, we simulate data using predictors from the TCGA ovarian cancer data (see Section 5) for three types of responses: (a) continuous, (b) binary, and (c) counts. We report the number of iterations over 100 runs of the B-RAIL algorithm and denote the average number of iterations until convergence by the dashed vertical black line. We see that the average number of iterations is between 4 and 5 with the maximum number of iterations reaching 15. These ranges were similar for all simulation designs, confirming relatively fast convergence of B-RAIL. Furthermore, we also show the true positive rates and the total number of selected features in Figure 17 to highlight B-RAIL’s convergence to a relatively accurate solution.

Figure 18: iid case,
Figure 19: iid case,
Figure 20: Non-constant variance
Figure 21: Block directed graph structure
Figure 22: OV data
Figure 23: We compare various selection methods under 5 different simulation scenarios. For each scenario, we simulate with three blocks (continuous, binary, counts) and a Gaussian response . We report the TPR and FDP for overall feature recovery and individual block recoveries across 200 runs. Note that we used oracle information for the Lasso-type methods.

Figure 23 duplicates the information in Table 1 but using boxplots for better visualization. We recall that these simulations compared B-RAIL and various Lasso-type methods (using oracle information) under four simulation designs with Gaussian responses (see Section 4 for further details). In almost all of these simulations, B-RAIL is able to achieve a higher TPR while maintaining a low FDP.

Figure 24: , , in each block
Figure 25: , , , ,
Figure 26: We compare various selection methods for non-uniform choices of and under the iid simulation design. Here, we simulated with three blocks (continuous, binary, counts) and a Gaussian response . Note that there were 200 runs and that we used oracle information for the Lasso-type methods.

We next verify that the above simulation results are not heavily dependent on our choice of and . In Figure 26, we ran the iid simulation design with Gaussian responses for different non-uniform values of and . These results show that B-RAIL can successfully recover features from unequally sized blocks with different amounts of sparsity while other methods may struggle to account for biases introduced by the different ’s and ’s.

iid Case, Poisson Response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.86 (6.3e-3) 0.10 (4.7e-3) 0.92 (3.8e-3) 0.15 (8.2e-3) 0.90 (6.2e-3) 0.00 (0.0e-0) 0.77 (1.1e-2) 0.13 (6.4e-3) Lasso - (oracle) 0.67 (3.3e-4) 0.33 (5.4e-4) 0.90 (1.0e-3) 0.18 (1.8e-4) 0.50 (0.0e-0) 0.28 (1.2e-3) 0.60 (0.0e-0) 0.50 (9.0e-4) Lasso - (oracle) 0.74 (1.6e-3) 0.26 (1.6e-3) 0.93 (4.6e-3) 0.21 (3.5e-3) 0.70 (4.9e-3) 0.29 (5.9e-3) 0.60 (1.4e-3) 0.26 (3.1e-3) Separate Lasso (oracle) 0.55 (1.6e-3) 0.43 (2.2e-3) 0.70 (0.0e-0) 0.30 (7.8e-4) 0.44 (4.8e-3) 0.51 (5.9e-3) 0.50 (0.0e-0) 0.48 (2.7e-3) Adaptive Lasso (oracle) 0.65 (2.2e-3) 0.34 (1.8e-3) 0.77 (4.5e-3) 0.29 (2.5e-3) 0.47 (4.6e-3) 0.29 (6.1e-3) 0.70 (0.0e-0) 0.41 (3.6e-3) Block Directed Graph Structure, Poisson Response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.81 (8.6e-3) 0.07 (5.2e-3) 0.80 (0.0e-0) 0.04 (5.4e-3) 0.97 (5.7e-3) 0.09 (6.2e-3) 0.68 (2.1e-2) 0.07 (1.2e-2) Lasso - (oracle) 0.70 (0.0e-0) 0.29 (1.3e-3) 0.90 (0.0e-0) 0.23 (4.1e-3) 0.90 (0.0e-0) 0.37 (2.0e-3) 0.30 (0.0e-0) 0.18 (1.3e-2) Lasso - (oracle) 0.70 (8.5e-4) 0.30 (8.5e-4) 0.89 (2.6e-3) 0.21 (4.3e-3) 0.90 (0.0e-0) 0.27 (3.6e-3) 0.30 (0.0e-0) 0.52 (7.4e-3) Separate Lasso (oracle) 0.50 (1.3e-3) 0.46 (2.3e-3) 0.80 (0.0e-0) 0.20 (0.0e-0) 0.51 (3.9e-3) 0.41 (7.0e-3) 0.20 (0.0e-0) 0.79 (1.7e-3) Adaptive Lasso (oracle) 0.70 (9.6e-4) 0.29 (1.4e-3) 0.81 (2.9e-3) 0.20 (1.3e-3) 1.00 (0.0e-0) 0.36 (2.5e-3) 0.30 (0.0e-0) 0.27 (5.7e-3) OV Data, Poisson Response Total Continuous Proportion Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.99 (2.1e-3) 0.05 (3.3e-3) 0.98 (6.4e-3) 0.11 (5.3e-3) 1.00 (0.0e-0) 0.01 (2.2e-3) 1.00 (0.0e-0) 0.03 (4.3e-3) Lasso - (oracle) 0.51 (1.9e-3) 0.46 (1.9e-3) 0.54 (5.6e-3) 0.45 (3.9e-3) 0.60 (0.0e-0) 0.31 (3.8e-3) 0.40 (0.0e-0) 0.61 (2.0e-3) Lasso - (oracle) 0.60 (1.1e-3) 0.40 (1.1e-3) 0.83 (4.8e-3) 0.46 (2.0e-3) 0.61 (4.9e-3) 0.21 (8.0e-3) 0.36 (7.9e-3) 0.43 (1.5e-2) Separate Lasso (oracle) 0.38 (2.2e-3) 0.60 (1.9e-3) 0.43 (4.6e-3) 0.53 (4.4e-3) 0.32 (3.8e-3) 0.65 (3.2e-3) 0.40 (1.0e-3) 0.60 (6.7e-4) Adaptive Lasso (oracle) 0.93 (1.5e-3) 0.07 (1.2e-3) 0.80 (0.0e-0) 0.12 (3.2e-3) 1.00 (1.0e-3) 0.00 (0.0e-0) 0.98 (4.2e-3) 0.09 (3.8e-4)

Table 4: We compare various selection methods under 3 different simulation scenarios. For each scenario, we simulate with three blocks (continuous, binary, counts) and a Poisson response . We report the TPR and FDP for overall feature recovery and individual block recoveries, averaged across 200 runs with standard errors in parentheses. We bold the best overall values for each simulation scenario.

iid Case, Binary Response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.83 (9.4e-3) 0.11 (1.0e-2) 0.92 (6.9e-3) 0.09 (1.0e-2) 0.73 (1.5e-2) 0.12 (1.3e-2) 0.82 (1.0e-2) 0.11 (1.2e-2) Lasso - (oracle) 0.65 (2.9e-3) 0.34 (3.0e-3) 0.80 (2.0e-3) 0.15 (4.9e-3) 0.54 (8.3e-3) 0.42 (6.9e-3) 0.60 (0.0e-0) 0.43 (3.9e-3) Lasso - (oracle) 0.72 (3.7e-3) 0.31 (2.8e-3) 0.87 (5.6e-3) 0.18 (8.0e-3) 0.69 (9.8e-3) 0.37 (7.6e-3) 0.61 (2.6e-3) 0.37 (6.0e-3) Separate Lasso (oracle) 0.51 (1.5e-3) 0.47 (2.3e-3) 0.60 (0.0e-0) 0.39 (2.8e-3) 0.43 (4.4e-3) 0.55 (5.0e-3) 0.50 (0.0e-0) 0.47 (4.3e-3) Adaptive Lasso (oracle) 0.60 (3.0e-3) 0.39 (2.7e-3) 0.91 (3.0e-3) 0.50 (3.7e-3) 0.39 (7.8e-3) 0.05 (9.6e-3) 0.49 (9.9e-3) 0.31 (6.2e-3) Block Directed Graph Structure, Binary Response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.75 (3.5e-3) 0.15 (5.0e-3) 0.90 (0.0e-0) 0.01 (3.6e-3) 0.83 (8.9e-3) 0.23 (7.2e-3) 0.51 (3.8e-3) 0.19 (1.1e-2) Lasso - (oracle) 0.68 (1.9e-3) 0.30 (1.8e-3) 0.80 (0.0e-0) 0.20 (5.7e-3) 0.95 (5.6e-3) 0.30 (3.9e-3) 0.30 (0.0e-0) 0.48 (5.7e-3) Lasso - (oracle) 0.69 (1.8e-3) 0.31 (1.9e-3) 0.80 (1.4e-3) 0.12 (3.8e-3) 0.96 (5.2e-3) 0.22 (6.1e-3) 0.30 (0.0e-0) 0.64 (6.6e-3) Separate Lasso (oracle) 0.49 (2.9e-3) 0.49 (2.6e-3) 0.74 (5.9e-3) 0.20 (5.7e-3) 0.51 (6.8e-3) 0.45 (6.3e-3) 0.21 (3.3e-3) 0.78 (3.2e-3) Adaptive Lasso (oracle) 0.68 (2.3e-3) 0.31 (2.7e-3) 0.96 (5.2e-3) 0.38 (4.5e-3) 0.98 (3.7e-3) 0.21 (6.4e-3) 0.09 (2.9e-3) 0.24 (2.9e-2) OV Data, Binary Response Total Continuous Proportion Counts TPR FDP TPR FDP TPR FDP TPR FDP B-RAIL 0.81 (3.5e-3) 0.11 (4.9e-3) 0.64 (7.8e-3) 0.21 (8.5e-3) 1.00 (0.0e-0) 0.05 (5.9e-3) 0.80 (6.1e-3) 0.09 (8.5e-3) Lasso - (oracle) 0.59 (2.9e-3) 0.39 (2.8e-3) 0.80 (0.0e-0) 0.35 (2.8e-3) 0.49 (7.2e-3) 0.33 (7.6e-3) 0.49 (5.7e-3) 0.49 (5.0e-3) Lasso - (oracle) 0.63 (2.2e-3) 0.37 (2.2e-3) 0.80 (0.0e-0) 0.39 (3.8e-3) 0.71 (7.1e-3) 0.40 (5.3e-3) 0.36 (5.6e-3) 0.19 (1.6e-2) Separate Lasso (oracle) 0.44 (2.1e-3) 0.54 (1.9e-3) 0.59 (3.1e-3) 0.40 (3.4e-3) 0.34 (5.0e-3) 0.63 (4.3e-3) 0.40 (0.0e-0) 0.60 (1.3e-3) Adaptive Lasso (oracle) 0.83 (2.9e-3) 0.16 (2.6e-3) 0.79 (3.8e-3) 0.22 (3.8e-3) 0.99 (3.0e-3) 0.01 (2.6e-3) 0.72 (6.3e-3) 0.24 (6.9e-3)

Table 5: We compare various selection methods under 3 different simulation scenarios. For each scenario, we simulate with three blocks (continuous, binary, counts) and a binary response . We report the TPR and FDP for overall feature recovery and individual block recoveries, averaged across 200 runs with standard errors in parentheses. We bold the best overall values for each simulation scenario.

For Poisson and binary responses, Tables 4 and 5 provide the results from additional simulation designs to supplement Table 2. Here, the response and predictors were simulated according to the description in Section 4 with , , and for each block.

For easier visualizations, we duplicate the results of Tables 4 and 5 using boxplots in Figure 33. As shown by the plots, B-RAIL is able to achieve higher TPR and maintain low FDP across various simulations for both binary and Poisson responses.

Figure 27: Binary response, iid case
Figure 28: Poisson response, iid case
Figure 29: Binary response, Block directed graph structure
Figure 30: Poisson response, Block directed graph structure
Figure 31: Binary response, OV data
Figure 32: Poisson response, OV data
Figure 33: We compare various selection methods under 3 different simulation scenarios. For each scenario, we simulate with three blocks (continuous, binary, counts) and either a binary response (left column of subplots) or Poisson response (right column of subplots). We report the TPR and FDP for overall feature recovery and individual block recoveries across 200 runs. Note that we used oracle information for the Lasso-type methods.

Gaussian response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP MCP (oracle) 1.00 (5.2e-4) 0.00 (5.2e-4) 1.00 (5.0e-4) 0.00 (0.0e-0) 1.00 (0.0e-0) 0.00 (4.5e-4) 1.00 (1.3e-3) 0.00 (1.3e-3) SCAD (oracle) 0.77 (5.2e-3) 0.23 (5.2e-3) 0.92 (2.8e-3) 0.18 (4.7e-3) 1.00 (0.0e-0) 0.17 (5.1e-3) 0.38 (1.4e-2) 0.43 (1.1e-2) B-RAIL 0.88 (4.8e-3) 0.16 (1.1e-2) 0.79 (4.8e-3) 0.12 (1.4e-2) 0.96 (6.8e-3) 0.12 (7.0e-3) 0.89 (5.2e-3) 0.22 (1.4e-2) Binary response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP MCP (oracle) 0.49 (2.7e-3) 0.47 (2.7e-3) 0.65 (4.1e-3) 0.27 (9.7e-3) 0.56 (5.2e-3) 0.55 (5.6e-3) 0.26 (3.7e-3) 0.52 (1.6e-2) SCAD (oracle) 0.50 (6.7e-3) 0.50 (6.8e-3) 0.58 (8.1e-3) 0.49 (4.4e-3) 0.61 (1.3e-2) 0.53 (9.5e-3) 0.30 (2.7e-3) 0.40 (1.4e-2) B-RAIL 0.75 (3.5e-3) 0.15 (5.0e-3) 0.90 (0.0e-0) 0.01 (3.6e-3) 0.83 (8.9e-3) 0.23 (7.2e-3) 0.51 (3.8e-3) 0.19 (1.1e-2) Poisson response Total Continuous Binary Counts TPR FDP TPR FDP TPR FDP TPR FDP MCP (oracle) 0.86 (1.3e-3) 0.13 (1.5e-3) 0.90 (1.6e-3) 0.08 (5.3e-3) 1.00 (0.0e-0) 0.18 (4.1e-3) 0.70 (3.6e-3) 0.07 (6.1e-3) SCAD (oracle) 0.65 (1.9e-3) 0.34 (2.0e-3) 0.81 (1.8e-3) 0.28 (3.7e-3) 0.84 (4.5e-3) 0.25 (4.1e-3) 0.30 (1.5e-3) 0.56 (5.0e-3) B-RAIL 0.81 (8.6e-3) 0.07 (5.2e-3) 0.80 (0.0e-0) 0.04 (5.4e-3) 0.97 (5.7e-3) 0.09 (6.2e-3) 0.68 (2.1e-2) 0.07 (1.2e-2)

Table 6: We compare B-RAIL to MCP and SCAD for three types of responses and three types of covariates (continuous , counts , and binary ) under the block directed graph simulation design. Here , , and in each block. Oracle model selection was used for MCP and SCAD penalties. For B-RAIL, we used stability selection, outlined in Section 3, thresholded at . We report the TPR and FDP for overall recovery and for each block separately across 200 runs. For the binary and Poisson responses, MCP and SCAD do not perform as well as B-RAIL.

In Table 6, we compare B-RAIL to the non-convex penalties, MCP and SCAD, under the block directed graph simulation design with three different types of responses. We see that MCP performs well with Gaussian responses, and in fact, the MCP penalty can be used instead of a Lasso penalty in the B-RAIL algorithm for Gaussian responses. However, the B-RAIL algorithm outperforms MCP and SCAD for non-Gaussian responses. We thus chose to use a Lasso penalty when introducing the B-RAIL algorithm for consistency.

Figure 34: 5 fold cross validation
Figure 35: Extended BIC
Figure 36: Stability Selection ()
Figure 37: Oracle selection
Figure 38: We compare feature recovery for B-RAIL and Lasso-type selection methods using oracle information, 5-fold CV, extended BIC, and stability selection to select the regularization parameters. Here, we simulate from the block directed graph simulation design with Gaussian responses. We report the TPR and FDP for overall feature recovery and individual block recoveries across 200 runs.

Figure 38 provides the same information as Table 3 but using boxplots for easier visualization. While the model selection techniques (i.e. CV, extended BIC, and stability selection) give lower values of than their oracle selection counterparts, B-RAIL outperforms even the oracle selection methods and yields the highest .