Context-dependent feature analysis with random forests

05/12/2016 ∙ by Antonio Sutera, et al. ∙ 0

In many cases, feature selection is often more complicated than identifying a single subset of input variables that would together explain the output. There may be interactions that depend on contextual information, i.e., variables that reveal to be relevant only in some specific circumstances. In this setting, the contribution of this paper is to extend the random forest variable importances framework in order (i) to identify variables whose relevance is context-dependent and (ii) to characterize as precisely as possible the effect of contextual information on these variables. The usage and the relevance of our framework for highlighting context-dependent variables is illustrated on both artificial and real datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Supervised learning finds applications in many domains such as medicine, economics, computer vision, or bioinformatics. Given a sample of observations of several inputs and one output variable, the goal of supervised learning is to learn a model for predicting the value of the output variable given any values of the input variables. Another common side objective of supervised learning is to bring as much insight as possible about the relationship between the inputs and the output variable. One of the simplest ways to gain such insight is through the use of feature selection or ranking methods that identify the input variables that are the most decisive or relevant for predicting the output, either alone or in combination with other variables. Among feature selection/ranking methods, one finds variable importance scores derived from random forest models that stand out from the literature mainly because of their multivariate and non parametric nature and their reasonable computational cost. Although very useful, feature selection/ranking methods however only provide very limited information about the often very complex input-output relationships that can be modeled by supervised learning methods. There is thus a high interest in designing new techniques to extract more complete information about input-output relationships than a single global feature subset or feature ranking.

In this paper, we specifically address the problem of the identification of the input variables whose relevance or irrelevance for predicting the output only holds in specific circumstances, where these circumstances are assumed to be encoded by a specific context variable. This context variable can be for example a standard input variable, in which case, the goal of contextual analyses is to better understand how this variable interacts with the other inputs for predicting the output. The context can also be an external variable that does not belong to the original inputs but that may nevertheless affect their relevance with respect to the output. Practical applications of such contextual analyses are numerous. E.g., one may be interested in finding variables that are both relevant and independent of the context, as in medical studies (see, e.g., Geissler et al., 2000), where one is often interested in finding risk factors that are as independent as possible of external factors, such as the sex of the patients, their origins or their data cohort. By contrast, in some other cases, one may be interested in finding variables that are relevant but dependent in some way on the context. For example, in systems biology, differential analysis (Ideker and Krogan, 2012) aims at discovering genes or factors that are relevant only in some specific conditions, tissues, species or environments.

Our contribution in this paper is two-fold. First, starting from common definitions of feature relevance, we propose a formal definition of context-dependent variables and provide a complete characterization of these variables depending on how their relevance is affected by the context variable. Second, we extend the random forest variable importances framework in order to identify and characterize variables whose relevance is context-dependent or context-independent. Building on existing theoretical results for standard importance scores, we propose asymptotic guarantees for the resulting new measures.

The paper is structured as follows. In Section 2, we first lay out our formal framework defining context-dependent variables and describing how the context may change their relevance. We describe in Section 3 how random forest variable importances can be used for identifying context-dependent variables and how the effect of contextual information on these variables can be highlighted. Our results are then illustrated in Section 4 on representative problems. Finally, conclusions and directions of future works are discussed in Section 5.

2 Context-dependent feature selection and characterization

Context-dependence.

Let us consider a set of input variables and an output and let us denote by the set . All input and output variables are assumed to be categorical, not necessarily binary111Non categorical outputs are discussed in Section 3.5.. The standard definitions of relevant, irrelevant, and marginally relevant variables based on their mutual information are as follows (Kohavi and John, 1997; Guyon and Elisseeff, 2003):

  • A variable is relevant to with respect to iff there exists a subset (possibly empty) such that .

  • A variable is irrelevant to with respect to iff, for all , .

  • A variable is marginally relevant to iff .

Let us now assume the existence of an additional (observed) context variable , also assumed to be categorical. Inspired by the notion of relevant and irrelevant variables, we propose to define context-dependent and context-independent variables as follows:

Definition 1.

A variable is context-dependent to with respect to iff there exists a subset and some values and such that222

In this definition and all definitions that follow, we assume that the events on which we are conditioning have a non-zero probability and that if such event does not exist then the condition of the definition is not satisfied.

:

(1)
Definition 2.

A variable is context-independent to with respect to iff for all subsets and for all values and , we have:

(2)

Context-dependent variables are thus the variables for which there exists a conditioning set in which the information they bring about the output is modified by the context variable. Context-independent variables are the variables that, in all conditionings , bring the same amount of information about the output whether the value of the context is known or not. This definition is meant to be as general as possible. Other more specific definitions of context-dependence are as follows:

(3)
(4)
(5)
(6)

These definitions all imply context-dependence as defined in Definition 1 but the converse is in general not true. For example, Definition (3) misses problems where the context makes some otherwise irrelevant variable relevant but where the information brought by this variable about the output is exactly the same for all values of the context. A variable that satisfies Definition (1) but not Definition (4) is given in example 1. This example can be easily adapted to show that both Definitions (5) and (6) are more specific than Definition (1) (by swapping the roles of and ).

Example 1.

This artificial problem is defined by two input variables and , an output , and a context . , , and

are binary variables taking their values in

, while is a quaternary variable taking its values in . All combinations of values for , , and have the same probability of occurrence and the conditional probability is defined by the two following rules:

  • If then with probability 1.

  • If then with probability and with probability .

The corresponding data table is given in Appendix A. For this problem, it is easy to show that and that , which means condition (1) is satisfied and is thus context-dependent to with respect to according to our definition. On the other hand, we can show that:

for any , which means that condition (4) can not be satisfied for .

To simplify the notations, the context variable was assumed to be a separate variable not belonging to the set of inputs . It can however be considered as an input variable, whose own relevance to (with respect to ) can be assessed as for any other input. Let us examine the impact of the nature of this variable on context-dependence. First, it is interesting to note that the definition of context-dependence is not symmetric. A variable being context-dependent to with respect to does not imply that the variable is context-dependent to with respect to .333But this would be the case if we had adopted definition (6). Second, the context variable does not need to be marginally relevant for some variable to be context-dependent, but it needs however to be relevant to with respect to . Indeed, we have the following theorem (proven in Appendix B):

Theorem 1.

is irrelevant to with respect to iff all variables in are context-independent to with respect to (and ) and .

As a consequence of this theorem, there is no interest in looking for context-dependent variables when the context itself is not relevant.

Characterizing context-dependent variables.

Contextual analyses need to focus only on context-dependent variables since, by definition, context-independent variables are unaffected by the context: their relevance status (relevant or irrelevant), as well as the information they contain about the output, remain indeed unchanged whatever the context.

Context-dependent variables may be affected in several directions by the context, depending both on the conditioning subset and on the value of the context. Given a context-dependent variable , a subset and some values and such that , the effect of the context can either be an increase of the information brought by () or a decrease of this information (). Furthermore, for a given variable , the direction of the change can differ from one context value to another (at fixed and ) but also from one conditioning to another (for a fixed context ). Example 2 below illustrates this latter case. This observation makes a global characterization of the effect of the context on a given context-dependent variable difficult. Let us nevertheless mention two situations where such global characterization is possible:

Definition 3.

A context-dependent variable is context-complementary (in a context ) iff for all and , we have .

Definition 4.

A context-dependent variable is context-redundant (in a context ) iff for all and , we have .

Context-complementary and redundant variables are variables that always react in the same direction to the context and thus can be characterized globally without loss of information. Context-complementary variables are variables that bring complementary information about the output with respect to the context, while context-redundant variables are variables that are redundant with the context. Note that context-dependent variables that are also irrelevant to are always context-complementary, since the context can only increase the information they bring about the output. Context-dependent variables that are relevant to however can be either context-complementary, context-redundant, or uncharacterized. A context-redundant variable can furthermore become irrelevant to as soon as for all , , and .

Example 2.

As an illustration, in the problem of Example 1, and are both relevant and context-dependent variables. can not be characterized globally since we have simultaneously:

for both and . is however context-complementary as the knowledge of always increases the information it contains about .

Related works.

Several authors have studied interactions between variables in the context of supervised learning. They have come up with various interaction definitions and measures, e.g., based on multivariate mutual information (McGill, 1954; Jakulin and Bratko, 2003), conditional mutual information (Jakulin, 2005; Van de Cruys, 2011), or variants thereof (Brown, 2009; Brown et al., 2012). There are several differences between these definitions and ours. In our case, the context variable has a special status and as a consequence, our definition is inherently asymmetric, while most existing variable interaction measures are symmetric. In addition, we are interested in detecting any information difference occurring in a given context (i.e., for a specific value of ) and for any conditioning subset , while most interaction analyses are interested in average and/or unconditional effects. For example, (Jakulin and Bratko, 2003) propose as a measure of the interaction between two variables and with respect to an output the multivariate mutual information, which is defined as . Unlike our definition, this measure can be shown to be symmetric with respect to its arguments. Adopting this measure to define context-dependence would actually amount at using condition (6) instead of condition (1), which would lead to a more specific definition as discussed earlier in this section.

The closest work to ours in this literature is due to Turney (1996), who proposes a definition of context-sensitivity that is very similar to our definition of context-dependence. Using our notations, Turney (1996) defines a variable as weakly context-sensitive to the variable if there exist some subset and some values , , , and such that these two conditions hold:

is furthermore defined as strongly context-sensitive to if is weakly sensitive to , is marginally relevant,and is not marginally relevant. These two definitions do not exactly coincide with ours and they have two drawbacks in our opinion. First, they do not consider that a perfect copy of the context is context-sensitive, which we think is counter-intuitive. Second, while strong context-sensitivity is asymmetric, the constraints about the marginal relevance of and seems also unnatural.

Our work is also somehow related to several works in the graphical model literature that are concerned with context-specific independences between random variables

(see e.g. Boutilier et al., 1996; Zhang and Poole, 1999). Boutilier et al. (1996) define two variables and as contextually independent given some and a context value as soon as . When are the parents of node

in a Bayesian network, then such context-specific independences can be exploited to simplify the conditional probability tables of node

and to speed up inferences. Boutilier et al. (1996)’s context-specific independences will be captured by our definition of context-dependence as soon as . However, our framework is more general as we want to detect any context dependencies, not only those that lead to perfect independences in some context.

3 Context analysis with random forests

In this section, we show how to use variable importances derived from Random Forests first to identify context-dependent variables (Section 3.2) and then to characterize the effect of the context on the relevance of these variables (Section 3.3). Derivations in this section are based on the theoretical characterization of variable importances provided in (Louppe et al., 2013), which is briefly reminded in Section 3.1. Section 3.4 discusses practical considerations and Section 3.5 shows how to generalize our results to other impurity measures.

3.1 Variable importances

Within the random forest framework, Breiman (2001) proposed to evaluate the importance of a variable for predicting by adding up the weighted impurity decreases for all nodes where is used, averaged over all trees in the forest:

(7)

where is the variable used in the split at node , is the proportion of samples reaching and is the mutual information.

According to Louppe et al. (2013), for any ensemble of fully developed trees in asymptotic learning sample size conditions, the Mean Decrease Impurity (MDI) importance (7) can be shown to be equivalent to

(8)

where denotes the set of subsets of of size . Most notably, it can be shown (Louppe et al., 2013) that this measure is zero for a variable iff is irrelevant to with respect to . It is therefore well suited for identifying relevant features.

3.2 Identifying context-dependent variables

Theorem 1 shows that if the context variable is irrelevant, then it can not interact with the input variables and thus modify their importances. This observation suggests to perform, as a preliminary test, a standard random forest variable importance analysis using all input variables and the context in order to check the relevance of the latter. If the context variable does not reveal to be relevant, then, there is no hope to find context-dependent variables.

Intuitively, identifying context-dependent variables seems similar to identifying the variables whose importance is globally modified when the context is known. Therefore, one first straightforward approach to identify context-dependent variables is to build a forest per value of the context variable, i.e., using only the data samples for which , and also globally, i.e. using all samples and not including the context among the inputs. Then it consists in deriving from these models an importance score for each value of the context, as well as a global importance score. Context-dependent variables are then the variables whose global importance score differs from the contextual importance scores for at least one value of the context.

More precisely, let us denote by the global score of a variable computed using (7) from all samples and by its importance score as computed according to (7) using only those samples such that . With this approach, a variable would be declared as context-dependent as soon as there exists a value such that .

Although straightforward, this approach has several drawbacks. First, in the asymptotic setting of Section 3.1, it is not guaranteed to find all context-dependent variables. Indeed, asymptotically, it is easy to show from (8) that can be written as:

(9)
(10)

Example 1 shows that can be equal to for a context-dependent variable. Therefore we have the property that if there exists an such that , then the variable is context-dependent but the opposite is unfortunately not true. Another drawback of this approach is that in the finite case, we do not have the guarantee that the different forests will have explored the same conditioning sets

and therefore, even assuming that the learning sample is infinite (and therefore that all mutual informations are perfectly estimated), we lose the guarantee that

for a given implies context-dependence.

To overcome these issues, we propose the following new importance score to identify context-dependent variables:

(11)

This score is meant to be computed from a forest of totally randomized trees built from all samples, not including the context variable among the inputs. At each node where the variable is used to split, one needs to compute the absolute value of the difference between the mutual information between and estimated from all samples reaching that node and the mutual information between and estimated only from the samples for which . The same forest can then be used to compute for all . A variable is then declared context-dependent as soon as there exists an such that .

Let us show that this measure is sound. In asymptotic conditions, i.e., with an infinite number of trees, one can show from (11) that becomes:

Asymptotically, this measure has now the very desirable property to not miss any context-dependent variable as formalized in the next theorem (the proof is in Appendix C).

Theorem 2.

A variable is context-independent to with respect to iff for all .

Given that the absolute differences are computed at each tree node, this measure also continues to imply context-dependence in the case of finite forests and infinite learning sample size. The only difference with the infinite forests is that only some conditionings and values will be tested and therefore one might miss the conditionings that are needed to detect some context-dependent variables.

3.3 Characterizing context-dependent variables

Besides identifying context-dependent variables, one would want to characterize their dependence with the context as precisely as possible. As discussed earlier, irrelevant variables (i.e, such that ) that are detected as context-dependent do not need much effort to be characterized since the context can only increase their importance. All these variables are therefore context-complementary.

Identifying the context-complementary and context-redundant variables among the relevant variables that are also context-dependent can in principle be done by simply comparing the absolute value of with , as formalized in the following theorem (proven in Appendix D).

Theorem 3.

If for a context-dependent variable , then is context-complementary if and context-redundant if .

This result allows to identify easily the context-complementary and context-redundant variables. In addition, if, for a context-redundant variable , we have , then this variable is irrelevant in the context .

Then it remains to characterize the context-dependent variables that are neither context-complementary nor context-redundant. It would be interesting to be able to also characterize them according to some sort of average effect of the context on these variables. Similarly as the common use of importance to rank variables from the most to the less important, we propose to use the importance to characterize the average global effect of context on the variable . Given the asymptotic formulation of this importance in Equation (10), a negative value of means that is essentially complementary with the context: in average over all conditionings, it brings more information about in context than when ignoring the context. Conversely, a positive value of means that the variable is essentially redundant with the context: in average over all conditionings, it brings less information about than when ignoring the context. Ranking the context-dependent variables according to would then give at the top the variables that are the most complementary with the context and at the bottom the variables that are the most redundant.

Note that, like , it is preferable to estimate by using the following formula rather than to estimate it from two forests by subtracting and :

(12)

This estimation method has the same asymptotic form as given in Equation (10) but, in the finite case, it ensures that the same conditionings are used for both mutual information measures. Note that in some applications, it is interesting also to have a global measure of the effect of the context. A natural adaptation of (12) to obtain such global measure is as follows:

which, in asymptotic sample and ensemble of trees size conditions, gives the following formula:

If is negative then the context variable makes variable globally more informative ( and are complementary with respect to and ). If is positive, then the context variable makes variable globally less informative ( and are redundant with respect to and ).

3.4 In practice

As a recipe when starting a context analysis, we suggest first to build a single forest using all input variables (but not the context ) and then to compute from this forest all importances defined in the previous section: the global importances and the different contextual importances, , , and , for all variables and context values .

Second, variables satisfying the context-dependence criterion, i.e., such that for at least one , can be identified from the other variables. Among context-dependent variables, an equality between and highlights that the context-dependent variable is either context-complementary or context-redundant (in ) depending on the sign of . Finally, the remaining context-dependent variables can be ranked according to (or for a more global analysis).

Note that, because mutual informations will be estimated from finite training sets, they will be generally non zero even for independent variables, leading to false positives in the identification of context-dependent variables. In practice, one could instead identify context-dependent variables by using a test where is some cut-off value greater than 0. In practice, the determination of this cut-off can be very difficult. In our experiments, we propose to turn the importances into -values by using random permutations. More precisely, 1000 scores

will be estimated by randomly permuting the values of the context variable in the original data (so as to simulate the null hypothesis corresponding to a context variable fully independent of all other variables). A

-value will then be estimated by the proportion of these permutations leading to a score greater than the score obtained on the original dataset.

0 0 0 0 2
0 0 0 1 2
0 0 1 0 2
0 0 1 1 2
0 1 0 0 0
0 1 0 1 0
0 1 1 0 1
0 1 1 1 1
1 0 0 0 2
1 0 0 1 2
1 0 1 0 2
1 0 1 1 2
1 1 0 0 0
1 1 0 1 1
1 1 1 0 0
1 1 1 1 1
Table 1: Problem 1: Values of , , , , .
1.0 0.125 0.125
1.0 0.5 0.0
1.0 0.0 0.5
0.0 0.375 0.125
0.0 -0.375 0.125
0.0 0.125 0.375
0.0 0.125 -0.375
0.0 -0.125 -0.125
Table 2: Problem 1: Variable importances as computed analytically using asymptotic formulas. Note that is context-independent and and are context-dependent.
0.5727 0.7514 0.5528 0.687 0.1746 0.0753 0.1073 0.0
0.4127 0.5815 0.5312 0.5421 0.6566 0.2258 0.372 0.0
0.6243 0.8057 0.5577 0.7343 0.0 0.0 0.0 0.0
0.2263 0.2431 0.1181 0.2241 0.4139 0.1961 0.2861 0.0
0.0987 0.0611 0.021 0.0736 0.1746 0.0753 0.1073 0.0
0.2179 0.2422 0.1111 0.2190 -0.3839 -0.1389 -0.2346 0.0
-0.0516 -0.0543 -0.0049 -0.0473 0.1746 0.0753 0.1073 0.0
Table 3: Problem 2: Variable importances as computed analytically using the asymptotic formulas for the different importance measures.
m - pval pval pval pval
0 age 0.2974 0.2942 0.2900 0.1505 0.899 0.1717 0.417 0.0032 0.938 0.0074 0.846
1 histologic-type 0.3513 0.1354 0.4005 0.2265 0.000 0.1183 0.121 0.2159 0.000 -0.0492 0.331
2 degree-of-diffe 0.4415 0.3725 0.4070 0.1827 0.680 0.1724 0.689 0.0690 0.102 0.0345 0.398
3 bone 0.2452 0.2342 0.2220 0.1088 0.396 0.0845 0.904 0.0110 0.717 0.0232 0.410
4 bone-marrow 0.0188 0.0190 0.0131 0.0128 0.892 0.0105 0.980 -0.0001 0.994 0.0057 0.682
5 lung 0.1677 0.1837 0.1420 0.1134 0.448 0.1079 0.397 -0.0160 0.605 0.0257 0.373
6 pleura 0.1474 0.1132 0.1127 0.0613 1.000 0.1026 0.097 0.0342 0.179 0.0348 0.165
7 peritoneum 0.3171 0.2954 0.2084 0.0939 0.968 0.1516 0.000 0.0216 0.710 0.1087 0.000
8 liver 0.2300 0.1844 0.2784 0.0888 0.966 0.1382 0.053 0.0456 0.134 -0.0483 0.100
9 brain 0.0466 0.0334 0.0566 0.0403 0.173 0.0279 0.814 0.0131 0.693 -0.0101 0.751
10 skin 0.0679 0.0310 0.0786 0.0426 0.922 0.0420 0.841 0.0369 0.107 -0.0107 0.663
11 neck 0.2183 0.0774 0.2255 0.1562 0.000 0.0710 0.575 0.1409 0.000 -0.0071 0.764
12 supraclavicular 0.1701 0.1807 0.1344 0.0942 0.379 0.0738 0.884 -0.0106 0.695 0.0357 0.136
13 axillar 0.1339 0.1236 0.0846 0.0748 0.214 0.0663 0.388 0.0103 0.795 0.0493 0.194
14 mediastinum 0.1826 0.1752 0.1613 0.1129 0.266 0.0867 0.853 0.0074 0.767 0.0213 0.404
15 abdominal 0.2558 0.2883 0.1512 0.1419 0.139 0.1526 0.028 -0.0325 0.368 0.1046 0.003
Table 4: Problem 3: Importances as computed with a forest of 1000 totally randomized trees. The context is defined by the binary context feature Sex ( denotes female and denotes male). P-values were estimated using 1000 permutations of the context variable. Grey cells highlight p-values under the 0.05 threshold.

3.5 Generalization to other impurity measures

All our developments so far have assumed a categorical output and the use of Shannon’s entropy as the impurity measure. Our framework however can be carried over to other impurity measures and thus in particular also to a numerical output . Let us define a generic impurity measure that assesses the impurity of the output at a tree node . The corresponding impurity decrease at a tree node is defined as:

(13)

with denoting the successor node of corresponding to value of . By analogy with conditional entropy and mutual information, let us define the population based measures and for any subset of variables as follows:

where the first sum is over all possible combinations of values for variables in . Now, substituting mutual information for the corresponding impurity decrease measure , all our results above remain valid, including Theorems 1, 2, and 3 (proofs are omitted for the sake of space). It is important however to note that this substitution changes the notions of both variable relevance and context-dependence. Definition 1 indeed becomes:

Definition 5.

A variable is context-dependent to with respect to iff there exists a subset and some values and such that

When

is numerical, a common impurity measure is variance, which defines

as the empirical variance computed at node . The corresponding and in Definition 5 are thus defined respectively as

We will illustrate the use of our framework in a regression setting with this measure in the next section.

4 Experiments

Problem 1.

The purpose of this first problem is to illustrate the different measures introduced earlier. This artificial problem is defined by three binary input variables , , and , a ternary output , and a binary context . All samples are enumerated in Table 1 and are supposed to be equiprobable. By construction, the output is defined as if , if and , and if and .

Table 2 reports all importance scores for the three inputs. These scores were computed analytically using the asymptotic formulas, not from actual experiments. Considering the global importances , it turns out that all variables are relevant, with clearly the most important variable and and of smaller and equal importances. According to and , is a context-independent variable, while and are two context-dependent variables. This result is as expected given the way the output is defined. For and , we have furthermore for both values of . is therefore context-complementary when and context-redundant when . Conversely, is context-redundant when and context-complementary when . is furthermore irrelevant when (since ) and is irrelevant when (since ). The values of and suggest that these two variables are in average complementary.

Problem 2.

This second experiment is based on an adaptation of the digit recognition problem initially proposed in Breiman et al. (1984) and reused in Louppe et al. (2013). The original problem contains 7 binary variables (,…,) and the output takes its values in . Each input represents the on-off status of one lightning segment of a seven-segment indicator and is determined univocally from . To create an artificial (binary) context, we created two copies of this dataset, the first one corresponding to and the second one to . The first dataset was unchanged, while in the second one variables , , and were turned into irrelevant variables. In addition, we included a new variable , irrelevant by construction in both contexts. The final dataset contains 320 samples, 160 in each context.

Table 3 reports possible importance scores for all the inputs. Again, these scores were computed analytically using the asymptotic formulas. As expected, variable has zero importance in all cases. Also as expected, variables , , and are all context-dependent ( for all of them). They are context-redundant (and even irrelevant) when and complementary when . More surprisingly, variables , , , and are also context-dependent, even if their distribution is independent from the context. This is due to the fact that these variables are complementary with variables , , and for predicting the output. Their context-dependence is thus a consequence of the context-dependence of , , . , , , and are all almost redundant when and complementary when , which expresses the fact that they provide more information about the output when , and are irrelevant () and less when , , and are relevant (). Nevertheless, remains irrelevant in every situation.

Problem 3.

We now consider bio-medical data from the Primary tumor dataset. The objective of the corresponding supervised learning problem is to predict the location of a primary tumor in patients with metastases. It was downloaded from the UCI repository (Lichman, 2013) and was collected by the University Medical Center in Ljubljana, Slovenia. We restrict our analysis to 132 samples without missing values. Patients are described by 17 discrete clinical variables (listed in the first column of Table 4) and the output is chosen among 22 possible locations. For this analysis, we use the patient gender as the context variable.

Table 4 reports variable importances computed with 1000 totally randomized trees and their corresponding p-values. According to the p-values of , two variables are clearly emphasized for each context: importances of histologic-type and neck both significantly decrease in the first context () and importances of peritoneum and abdominal both significantly decrease in the second context (). While the biological relevance of these finding needs to be verified, such dependences could not have been highlighted from standard random forests importances.

Note that the same importances computed using the asymptotic formulas are provided in Appendix E. Importance values are very similar, highlighting that finite forests provide good enough estimates for this problem.

Problem 4.

(a)
(b)
(c)
(d)
Figure 1: Results for Problem 4. Each matrix represents significant context-dependent gene-gene interactions as found using in (a)(b) and in (c)(d), in GBM sub-type Mesenschymal in (a)(c) and Proneural in (b)(d). In (a) and (b), cells are colored according to . In (c) and (d), cells are colored according to . Positive (resp negative) values are in blue (resp. red) and highlight context-redundant (resp. context-complementary) interactions. Higher absolute values are darker.

As a last experiment, we consider a publicly available brain cancer gene expression dataset (Verhaak et al., 2010)

. This dataset collects measurements of mRNA expression levels of 11861 genes in 220 tissue samples from patients suffering from glioblastoma multiforme (GBM), the most common form of malignant brain cancer in adults. Samples are classified into four GBM sub-types: Classical, Mesenchymal, Neural and Proneural. The interest of this dataset is to identify the genes that play a central role in the development and progression of the cancer and thus improve our understanding of this disease. In our experiment, our aim is to exploit importance scores to identify interactions between genes that are significantly affected by the cancer sub-type considered as our context variable. This dataset was previously exploited by

Mohan et al. (2014), who used it to test a method based on Gaussian graphical models for detecting genes whose global interaction patterns with all the other genes vary significantly between the subtypes. This latter method can be considered as gene-based, while our approach is link-based.

Following (Mohan et al., 2014), we normalized the raw data using Multi-array Average (RMA) normalization. Then, the data was corrected for batch effects using the software ComBat (Johnson et al., 2007) and then transformed. Following (Mohan et al., 2014), we focused our analysis on only two GBM sub-types, Proneural (57 tissue samples) and Mesenchymal (56 tissue samples), and on a particular set of 32 genes, which are all genes involved in the TCR signaling pathway as defined in the Reactome database (Matthews et al., 2009). The final dataset used in the experiments below thus contains 113 samples, 57 and 56 for both context values respectively, and 32 variables.

To identify gene-gene interactions affected by the context, we performed a contextual analysis as described in Section 3 for each gene in turn, considering each time a particular gene as the target variable and all other genes as the set of input variables . This procedure is similar to the procedure adopted in the Random forests-based gene network inference method called GENIE3 (Huynh-Thu et al., 2010), that was the best performer in the DREAM5 network inference challenge (Marbach et al., 2012). Since gene expressions are numerical targets, we used variance as the impurity measure (see Section 3.5) and we built ensembles of 1000 totally randomized trees in all experiments.

The matrices in Figure 1 highlight context-dependent interactions found using different importance measures (detailed below). A cell of these matrices corresponds to the importance of gene when gene is the output (the diagonal is irrelevant). White cells correspond to non significant context-dependencies as determined by random permutations of the context variable, using a significance level of 0.05. Significant context-dependent interactions in Figures 1(a) and (b) were determined using the importance defined in (11), which is the measure we advocate in this paper. As a baseline for comparison, Figures 1(c) and (d) show significant interactions as found using the more straightforward score defined in (10). In Figures 1(a) and (b) (resp. (c) and (d)), significant cells are colored according to the value of defined in (12). In Figures 1(c) and (d), they are colored according to the value of in (10) instead. Blue (resp. red) cells correspond to positive (resp. negative) values of or and thus highlight context-redundant (resp. context-complementary) interactions. The darker the color, the higher the absolute value of or .

Respectively 49 and 26 context-dependent interactions are found in Figures 1(a) and (b). In comparison, only 3 and 4 interactions are found respectively in Figures 1(c) and (d) using the more straightforward score . Only 1 interaction is common between Figures 1(a) and (c), while 3 interactions are common between Figures 1(b) and (d). The much lower sensitivity of with respect to was expected given the discussions in Section 3.2. Although more straightforward, the score , defined as the difference , indeed suffers from the fact that and are estimated from different ensembles and thus do not explore the same conditionings in finite setting. also does not have the same guarantee as to find all context-dependent variables.

5 Conclusions

In this work, our first contribution is a formal framework defining and characterizing the dependence to a context variable of the relationship between the input variables and the output (Section 2). As a second contribution, we have proposed several novel adaptations of random forests-based variable importance scores that implement these definitions and characterizations and we have derived performance guarantees for these scores in asymptotic settings (Section 3). The relevance of these measures was illustrated on several artificial and real datasets (Section 4).

There remain several limitations to our framework that we would like to address as future works. All theoretical derivations in Sections 2 and 3 concern categorical input variables. It would be interesting to adapt our framework to continuous input variables, and also, probably with more difficulty, to continuous context variables. Finally, all theoretical derivations are based on forests of totally randomized trees (for which we have an asymptotic characterization). It would be interesting to also investigate non totally randomized tree algorithms (e.g., Breiman (2001)’s standard Random Forests method) that could provide better trade-offs in finite settings.

Acknowledgements. Antonio Sutera is a recipient of a FRIA grant from the FNRS (Belgium) and acknowledges its financial support. This work is supported by PASCAL2 and the IUAP DYSCO, initiated by the Belgian State, Science Policy Office. The primary tumor data was obtained from the University Medical Centre, Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. Soklic.

References

  • Boutilier et al. (1996) Boutilier, C., Friedman, N., Goldszmidt, M., and Koller, D. (1996). Context-specific independence in bayesian networks. In

    Proceedings of the Twelfth International Conference on Uncertainty in Artificial Intelligence

    , UAI’96, pages 115–123, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
  • Breiman et al. (1984) Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classification and regression trees. CRC press.
  • Brown (2009) Brown, G. (2009). A new perspective for information theoretic feature selection. In International conference on artificial intelligence and statistics, pages 49–56.
  • Brown et al. (2012) Brown, G., Pocock, A., Zhao, M.-J., and Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The Journal of Machine Learning Research, 13(1):27–66.
  • Geissler et al. (2000) Geissler, H. J., Hölzl, P., Marohl, S., Kuhn-Régnier, F., Mehlhorn, U., Südkamp, M., and de Vivie, E. R. (2000). Risk stratification in heart surgery: comparison of six score systems. European Journal of Cardio-thoracic surgery, 17(4):400–406.
  • Guyon and Elisseeff (2003) Guyon, I. and Elisseeff, A. (2003). An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182.
  • Huynh-Thu et al. (2010) Huynh-Thu, V. A., Irrthum, A., Wehenkel, L., and Geurts, P. (2010). Inferring regulatory networks from expression data using tree-based methods. PLoS ONE, 5(9).
  • Ideker and Krogan (2012) Ideker, T. and Krogan, N. J. (2012). Differential network biology. Molecular systems biology, 8(1).
  • Jakulin (2005) Jakulin, A. (2005). Machine learning based on attribute interactions. PhD thesis, Univerza v Ljubljani.
  • Jakulin and Bratko (2003) Jakulin, A. and Bratko, I. (2003). Analyzing attribute dependencies. Springer.
  • Johnson et al. (2007) Johnson, W. E., Li, C., and Rabinovic, A. (2007).

    Adjusting batch effects in microarray expression data using empirical bayes methods.

    Biostatistics, 8(1):118–127.
  • Kohavi and John (1997) Kohavi, R. and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324.
  • Lichman (2013) Lichman, M. (2013). UCI machine learning repository.
  • Louppe et al. (2013) Louppe, G., Wehenkel, L., Sutera, A., and Geurts, P. (2013). Understanding variable importances in forests of randomized trees. In Advances in Neural Information Processing Systems, pages 431–439.
  • Marbach et al. (2012) Marbach, D., Costello, J. C., Küffner, R., Vega, N. M., Prill, R. J., Camacho, D. M., Allison, K. R., Kellis, M., Collins, J. J., Stolovitzky, G., et al. (2012). Wisdom of crowds for robust gene network inference. Nature methods, 9(8):796–804.
  • Matthews et al. (2009) Matthews, L., Gopinath, G., Gillespie, M., Caudy, M., Croft, D., de Bono, B., Garapati, P., Hemish, J., Hermjakob, H., Jassal, B., et al. (2009). Reactome knowledgebase of human biological pathways and processes. Nucleic acids research, 37(suppl 1):D619–D622.
  • McGill (1954) McGill, W. J. (1954). Multivariate information transmission. Psychometrika, 19(2):97–116.
  • Mohan et al. (2014) Mohan, K., London, P., Fazel, M., Witten, D., and Lee, S.-I. (2014). Node-based learning of multiple gaussian graphical models. The Journal of Machine Learning Research, 15(1):445–488.
  • Turney (1996) Turney, P. (1996). The identification of context-sensitive features: A formal definition of context for concept learning. In 13th International Conference on Machine Learning (ICML96), Workshop on Learning in Context-Sensitive Domains, pages 60–66.
  • Van de Cruys (2011) Van de Cruys, T. (2011). Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pages 16–20. Association for Computational Linguistics.
  • Verhaak et al. (2010) Verhaak, R. G., Hoadley, K. A., Purdom, E., Wang, V., Qi, Y., Wilkerson, M. D., Miller, C. R., Ding, L., Golub, T., Mesirov, J. P., et al. (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer cell, 17(1):98–110.
  • Zhang and Poole (1999) Zhang, N. L. and Poole, D. L. (1999). On the role of context-specific independence in probabilistic inference. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 99, Stockholm, Sweden, July 31 - August 6, 1999. 2 Volumes, 1450 pages, pages 1288–1293.