Relief-Based Feature Selection: Introduction and Review

11/22/2017 ∙ by Ryan J. Urbanowicz, et al. ∙ University of Pennsylvania Ursinus College 0

Feature selection plays a critical role in data mining, driven by increasing feature dimensionality in target problems and growing interest in advanced but computationally expensive methodologies able to model complex associations. Specifically, there is a need for feature selection methods that are computationally efficient, yet sensitive to complex patterns of association, e.g. interactions, so that informative features are not mistakenly eliminated prior to downstream modeling. This paper focuses on Relief-based algorithms (RBAs), a unique family of filter-style feature selection algorithms that strike an effective balance between these objectives while flexibly adapting to various data characteristics, e.g. classification vs. regression. First, this work broadly examines types of feature selection and defines RBAs within that context. Next, we introduce the original Relief algorithm and associated concepts, emphasizing the intuition behind how it works, how feature weights generated by the algorithm can be interpreted, and why it is sensitive to feature interactions without evaluating combinations of features. Lastly, we include an expansive review of RBA methodological research beyond Relief and its popular descendant, ReliefF. In particular, we characterize branches of RBA research, and provide comparative summaries of RBA algorithms including contributions, strategies, functionality, time complexity, adaptation to key data characteristics, and software availability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 15

page 18

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

The fundamental challenge of almost any data mining or modeling task is to identify and characterize relationships between one or more features in the data (also known as predictors or attributes) and some endpoint (also known as the dependent variable, class, outcome, phenotype, or concept). In most datasets, only a subset of available features are relevant features, i.e. informative in determining the endpoint value. The remaining irrelevant features, which are rarely distinguishable a priori in real world problems, are not informative yet contribute to the overall dimensionality of the problem space. This increases the difficulty and computational burden placed on modeling methods. Feature selection could generically be defined as the process of identifying relevant features and discarding irrelevant ones.

Figure 1 illustrates the typical stages of a data mining analysis pipeline. Specifically, raw data is preprocessed in preparation for analysis. This typically includes some type of cross validation where the data is split into training, validation, and testing subsets to avoid overfitting and assess the generalizability of the final model. Next, different feature processing approaches can be employed to remove irrelevant features or construct better relevant ones. Modeling then takes place on this preprocessed data. Model performance could then feed back into another round of feature processing (dotted line). This is the case for wrapper feature selection methods, reviewed below. The final model is ultimately assessed and interpreted in a post analysis stage that ideally leads to the discovery of useful knowledge. Feature selection is an important part of a successful data mining pipeline, particularly in problems with very large feature spaces. Poorly performed feature selection can have significant downstream consequences on data mining, particularly when relevant features have been mistaken as irrelevant and removed from consideration.

Figure 1: Typical stages of a data mining analysis pipeline. Feature selection is starred as it is the focus of this review. The dotted line indicates how model performance can be fed back into feature processing, iteratively removing irrelevant features or seeking to construct relevant ones.

1.1 Types of Feature Selection

A large variety of feature selection methodologies have been proposed and research continues to support the claim that there is no universal “best” method for all tasks (Bolón-Canedo et al., 2013). In order to navigate methodological options and assist in selecting a suitable method for a given task it is useful to start by characterizing and categorizing different feature selection methods (Dash and Liu, 1997; Ni, 2012; Bolón-Canedo et al., 2013; Jović et al., 2015). One such characterization is with regards to the feature selection objective.

  1. Idealized: find the minimally sized feature subset that is necessary and sufficient to describe the target concept (Kira and Rendell, 1992b).

  2. Target Feature Count: select a subset of features from a total set of features, , such that the value of a criterion function is optimized over all subsets of size (Narendra and Fukunaga, 1977).

  3. Prediction Accuracy Improvement: choose a subset of features that best increases prediction accuracy or decreases model complexity without significantly decreasing the prediction accuracy (Inza et al., 2000).

  4. Approximate Original Class Prediction Probability Distribution

    : for classification problems, select a feature subset that yields a class prediction probability distribution that is as close as possible to the class prediction probability distribution given all features. In contrast with prediction accuracy this perspective seeks to preserve additional information regarding probabilities of class predictions

    (Koller and Sahami, 1996).

  5. Rank and Define Cutoff: first rank all features using some surrogate measure of feature ‘value’, then define the feature subset by applying an ad-hoc cutoff. This cutoff may be determined by statistical or subjective likelihood of relevance or simply a desired number of features in the subset (Kira and Rendell, 1992a).

This list is an updated version of one regularly used in the literature (Dash and Liu, 1997; Mlambo et al., 2016; Tang et al., 2014; Gore and Govindaraju, 2016). Alternatively, feature selection methods can be distinguished based on their relationship with the construction of the model (i.e. induction) (Saeys et al., 2007; Bolón-Canedo et al., 2013; Chandrashekar and Sahin, 2014; Tang et al., 2014; Jović et al., 2015; Mlambo et al., 2016).

  1. Filter Methods

    : use a ‘proxy measure’ calculated from the general characteristics of the training data to score features or feature subsets as a processing step prior to modeling. Filters are generally much faster and function independently of the induction algorithm, meaning that selected features can then be passed to any modeling algorithm. Filter methods can be roughly classified further by the filtering measures they employ, i.e. information, distance, dependence, consistency, similarity, and statistical measures

    (Dash and Liu, 1997; Bolón-Canedo et al., 2013; Jović et al., 2015). Examples include information gain (Hunt et al., 1966), chi-square (Jin et al., 2006), and Relief (Kira and Rendell, 1992b).

  2. Wrapper Methods

    : employ any stand-alone modeling algorithm to train a predictive model using a candidate feature subset. The testing performance on a hold-out set is typically used to score the feature set. Alternatively in a modeling algorithm like a random forest, estimated feature importance scores can be applied to select a feature subset

    (Menze et al., 2009). In any wrapper method, a new model must be trained to test any subsequent feature subset, therefore wrapper methods are typically iterative and computationally intensive, but can identify the best performing features set for that specific modeling algorithm (Guyon and Elisseeff, 2003; Bolón-Canedo et al., 2013; Jović et al., 2015). Each iteration of the wrapper, the feature subset is generated based on the selected search strategy, e.g. forward or backward selection (Kittler, 1978; Langley, 1994)

    or a heuristic feature subset selection

    (Van Laarhoven and Aarts, 1987; Holland, 1992). Examples include wrappers for Naïve Bayes (Cortizo and Giraldez, 2006)

    , Support Vector Machines (SVM)

    (Bradley and Mangasarian, 1998), and most any modeling algorithm combined with a feature subset generation approach. Thus a wrapper method is defined by both the selected induction algorithm as well as the feature subset search strategy. However, due to the computational complexity of wrappers, only the simplest modeling methods can be used efficiently.

  3. Embedded Methods: perform feature selection as a part of the modeling algorithm’s execution. These methods tend to be more computationally efficient than wrappers because they simultaneously integrate modeling with feature selection. This can be done, for instance, by optimizing a two-part objective function with (1) a goodness-of-fit term and (2) a penalty for a larger number of features. As with wrappers, the features selected by embedded methods are induction algorithm dependent (Guyon and Elisseeff, 2003; Bolón-Canedo et al., 2013; Jović et al., 2015). Examples include Lasso (Tibshirani, 1996), Elastic Net (Zou and Hastie, 2005)

    , and various decision tree based algorithms, e.g. CART

    (Breiman et al., 1984), C4.5 (Quinlan, 1993)

    , and most recently, XGBoost

    (Chen and Guestrin, 2016).

Many hybrid methods have also been proposed that seek to combine the advantages of wrappers and filters (Jović et al., 2015).

Lastly, feature selection approaches have also been broadly categorized as relying on either individual evaluation or subset evaluation (Yu and Liu, 2004; Bolón-Canedo et al., 2013). Individual evaluation, (i.e. feature weighting/ranking) assesses individual features and assigns them weights/scores according to their degrees of relevance (Blum and Langley, 1997; Yu and Liu, 2004). Subset evaluation instead assesses candidate feature subsets that are selected based on a given search strategy (Bolón-Canedo et al., 2013). Filter, wrapper, or embedded methods can be either subset or individual evaluation methods.

The remainder of this paper will focus on the family of Relief-based feature selection methods referred to here as Relief-Based Algorithms (RBAs) that can be characterized as individual evaluation filter methods. For reviews of features selection methods in general, we refer readers to (Langley, 1994; Dash and Liu, 1997; Guyon and Elisseeff, 2003; Belanche and González, 2011; Bolón-Canedo et al., 2013; Tang et al., 2014; Chandrashekar and Sahin, 2014; Jović et al., 2015; Mlambo et al., 2016).

1.2 Why Focus on Relief-based Feature Selection?

One advantage of certain wrapper or embedded methods is that by relying on subset evaluation they have the potential to capture feature dependencies in predicting the endpoint, i.e. interactions (Bolón-Canedo et al., 2013). In contrast, very few filter methods, besides for example, FOCUS (Almuallim and Dietterich, 1991) and INTERACT (Zhao and Liu, 2009) that are notably subset evaluation filters, claim to be able to handle feature interactions. The most reliable but naive approach for identifying feature interactions is to exhaustively search over all subsets of the given feature set, e.g. FOCUS. This quickly becomes computationally intractable in problems with larger feature spaces. The inefficiency of these approaches is that they must explicitly search through combinations of features. Alternatively, the Relief algorithm, and its derivatives are, to the best of our knowledge, the only individual evaluation filter algorithms capable of detecting feature dependencies. These algorithms do not search through feature combinations, but rather use the concept of nearest neighbors to derive feature statistics that indirectly account for interactions. Furthermore, RBAs retain the generalized advantages of filter algorithms, i.e. they are relatively fast (with an asymptotic time complexity of ), and selected features are not induction algorithm dependent. The ability to confidently utilize selected features with different induction algorithms may save further downstream computational effort when applying more than a single modeling technique. This is relevant considering the ’no-free lunch’ theorem proposed by Wolpert and Macready (1997) that suggests no one modeling algorithm can be optimal for all problems, and the widely acknowledged value of ensemble methods reviewed by Rokach (2010)

that combine input from multiple statistical or machine learning induction methods to make the best informed predictions. Lastly, individual evaluation approaches, including RBAs, offer a greater flexibility of use. Specifically, individual feature weights may be applied not only to select ‘top’ features, but can also be applied as expert knowledge to guide stochastic machine learning algorithms such as evolutionary algorithms

(Urbanowicz et al., 2012). Furthermore, when selecting features, feature sets of different sizes can be selected based on whatever criteria is desired for feature inclusion from a ranked feature list.

The following two subsections address important considerations related to our assertion that RBAs deserve particular attention. A closer look at the strengths and weaknesses of the Relief algorithm is given in Section 2.1.1.

1.2.1 Feature Construction

An alternative or supplemental approach to facilitate the detection and modeling of interactions is to apply feature construction (see Figure 1

), also known as constructive induction or feature extraction. Feature construction methods, e.g. principle component analysis or linear discriminant analysis

(Martínez and Kak, 2001), define new features as a function of two or more other features (Michalski, 1983). This subset of constructed features can be added to the original feature space, or analyzed in its place (achieving dimensionality reduction). A common side effect of most any feature construction method is that the original features are no longer recognizable, leading to challenges in downstream model interpretability.

One feature construction method geared specifically towards capturing feature interactions is multifactor dimensionality reduction (MDR) (Ritchie et al., 2001). Another more general example is polynomial feature construction that is able to detect multiplicative interactions (Sutton and Matheus, 1991). These approaches attempt to combine individual features that may be interacting and construct a single feature that can be more easily identified as relevant using any simple feature selection or induction method. There are many possible feature construction approaches to chose from and some can be quite computationally expensive. Notably, applying feature construction does not necessarily preclude the need for feature selection. Thus, assuming that a feature selection and modeling approach has been chosen that is sensitive to a target interaction dimensionality (e.g. 2-way or 3-way), it may be most efficient to skip feature construction, particularly if downstream model interpretation is critical. While feature construction certainly has its own utility, further discussion is outside the scope of this review.

1.2.2 Redundancy

Relevant features can be more restrictively defined as any feature that is neither irrelevant nor redundant to the target concept (Koller and Sahami, 1996; Dash and Liu, 1997). Feature redundancy is explored further by Yu and Liu (2004). Some feature selection methods seek to remove redundant features while others do not. Caution should be used when removing presumably redundant features, because unless two features are perfectly correlated (i.e. truly redundant) there may still be information to be gained from including them both (Guyon and Elisseeff, 2003). One repeatedly noted drawback of RBAs is that they do not remove feature redundancies, i.e. they seek to select all features relevant to the endpoint regardless of whether some features are strongly correlated with others (Kira and Rendell, 1992b; Belanche and González, 2011; Flórez-López, 2002). However, except for features that are perfectly correlated, it is not always clear whether useful information is being lost when ‘redundant’ features are removed. For example, it has been suggested that preserving redundant features can be a benefit, as it “may point to meaningful clusters of correlated phenotypes” (Todorov, 2016). If removing redundancy is clearly important to success in a given problem domain, many effective methods are available that can be applied before, after, or integrated with RBA feature selection to remove feature redundancies (Bins and Draper, 2001; Flórez-López, 2002; Yang and Li, 2006; Sun, 2007; Challita et al., 2015; Agre and Dzhondzhorov, 2016; Liu et al., 2015; Guyon et al., 2003).

1.3 Paper Summary

In the text that follows, we (1) introduce RBAs from the perspective of the original Relief algorithm noting key concepts and intuitions, (2) examine the contributions of the landmark ReliefF algorithm, (3) differentiate thematically distinct branches of RBA research, (4) review methodological expansions and advancements introduced by derivative members of the RBA family in the wake of Relief and ReliefF, (5) consider RBA evaluations, and (6) summarize software availability. This review was prepared to complement a comprehensive research comparison of ‘core’ RBAs presented by Urbanowicz et al. (2018).

2 Introduction to Relief

In this section we provide algorithmic and conceptual descriptions of the original Relief algorithm relevant to understanding all members of the RBA family.

2.1 Relief

Kira and Rendell (1992b, a) formulated the original Relief algorithm inspired by instance-based learning (Aha et al., 1991; Callan et al., 1991). As an individual evaluation filtering feature selection method, Relief calculates a proxy statistic for each feature that can be used to estimate feature ‘quality’ or ‘relevance’ to the target concept (i.e. predicting endpoint value). These feature statistics are referred to as feature weights ( weight of feature ‘’), or more casually as feature ‘scores’ that can range from (worst) to (best). Notably, the original Relief algorithm was limited to binary classification problems, and had no mechanism to handle missing data. Strategies to extend Relief to multi-class or continuous endpoint problems are not detailed here, but are described in the respective works cited in the review section of this paper.

0:  for each training instance a vector of feature values and the class value
   number of training instances
   number of features (i.e. attributes)
  Parameter: number of random training instances out of used to update
  
  initialize all feature weights
  for i:=1 to  do
     randomly select a ‘target’ instance
     find a nearest hit ‘’ and nearest miss ‘’ (instances)
     for A:= 1 to  do
        diffdiff
     end for
  end for
  return  the vector of feature scores that estimate the quality of features
Algorithm 1 Pseudo-code for the original Relief algorithm

As summarized by the pseudo-code in Algorithm 1, the Relief algorithm cycles through random training instances (), selected without replacement, where is a user-defined parameter. Each cycle, is the ‘target’ instance and the feature score vector W is updated based on feature value differences observed between the target and neighboring instances. Therefore each cycle, the distance between the ‘target’ instance and all other instances is calculated. Relief identifies two nearest neighbor instances of the target; one with the same class, called the nearest hit () and the other with the opposite class, called the nearest miss (). The last step of the cycle updates the weight of a feature in if the feature value differs between the target instance and either the nearest hit or the nearest miss (see Figure 2). Features that have a different value between and support the hypothesis that they are informative of outcome, so the quality estimation W[A] is increased. Conversely, features with differences between and provide evidence to the contrary, so the quality estimation W[A] is decreased. The diff function in Algorithm 1 calculates the difference in value of feature between two instances and , where and is either or , when performing weight updates (Robnik-Šikonja and Kononenko, 2001). For discrete (e.g. categorical or nominal) features, diff is defined as:

(1)

and for continuous (e.g. ordinal or numerical) features, diff is defined as:

(2)

The maximum and minimum values of A are determined over the entire set of instances. This normalization ensures that weight updates fall between 0 and 1 for both discrete and continuous features. Additionally, in updating W[A], dividing the output of diff by guarantees that all final weights will be normalized within the interval .

Figure 2: Relief updating for a given target instance when it is compared to its nearest miss and hit. In this example, features are discrete with possible values of X, Y, or Z, and endpoint is binary with a value of 0 or 1. Notice that when the value of a feature is different, the corresponding feature weight increases by for the nearest miss, and reduces by for the nearest hit.

The diff function is also used to calculate the distance between instances when finding nearest neighbors. The total distance is simply the sum of diff distances over all attributes (i.e. Manhattan distance). Technically, the original Relief algorithm used Euclidian distance rather than Manhattan distance i.e. the diff terms were squared during instance distance measurements and feature weighting. However, experiments by (Kononenko et al., 1997) indicated no significant difference between results using diff or squared diff, thus the simplified description of the Relief algorithm has become standard. It has also been suggested that any valid distance metric could be used by Relief (Todorov, 2016). Thus, determining the best distance metric remains an open research question. While the above diff function performs well when features are either uniformly discrete or continuous, it has been noted that given a dataset with a mix of discrete and continuous features, this diff function can underestimate the quality of the continuous features (Kononenko and Šikonja, 2008). One proposed solution to this problem is a ramp function that naively assigns a full diff of 0 or 1 if continuous feature values are some user defined minimum or maximum value apart from one another, respectively, and a function of the distance from these boundaries otherwise (Hong, 1997; Robnik-Šikonja and Kononenko, 2003; Kononenko and Šikonja, 2008). However since this approach adds two additional user-defined parameters requiring problem dependent optimization, it may be challenging to apply in practice.

2.1.1 Strengths and Limitations

Regarding strengths, Relief has been presented as being both non-myopic (Kononenko and Šikonja, 2008), i.e. it estimates the quality of a given feature in the context of other features, and non-parametric (Todorov, 2016), i.e. it makes no assumptions regarding the population distribution or sample size. The efficiency of the algorithm has been attributed to the fact that it doesn’t explicitly explore feature subsets and because it does not bother trying to identify an optimal minimum feature subset size (Kira and Rendell, 1992b). Instead, Relief was originally “intended as a screener to identify a subset of features that may not be the smallest and may still include some irrelevant and redundant features, but that is small enough to use with more refined approaches in a detailed anlaysis” (Todorov, 2016). Consider that an exhaustive search for interactions between all feature pairs alone would have a time complexity of , while Relief boasts a time complexity of , or whenever . Furthermore, it has been suggested that Relief could be viewed as an anytime algorithm, i.e. one that can be stopped and yield results at any time, but it is presumed that with more time or data it will improve the results (Robnik-Šikonja and Kononenko, 2003).

Regarding limitations, the original Relief analysis suggests that the algorithm can be fooled by insufficient training cycles (i.e. not a large enough ). The original paper also suggests that Relief is fairly noise-tolerant and unaffected by feature interactions. However, later work identified that Relief was susceptible to noise interfering with the selection of nearest neighbors (Kononenko, 1994). Further, research into RBAs has, until recently, been limited to considering 2-way feature interactions only. Therefore, it was unclear if RBAs could detect feature interactions with a dimensionality beyond 2 features. Research paired with this review suggests that only specific RBAs have the ability to detect higher order interactions (e.g. 3-way, 4-way, and 5-way), thus RBAs are only universally reliable in detecting 2-way interactions (Urbanowicz et al., 2018)

. Relief has also been noted to have a reduced power to identify relevant non-monotonic features (e.g. features with a Gaussian distribution)

(Bins and Draper, 2002). Most importantly, it has been repeatedly demonstrated empirically and theoretically that core RBA performance deteriorates as the number of irrelevant features becomes ‘large’. (Robnik-Šikonja and Kononenko, 2003; Moore and White, 2007; Eppstein and Haake, 2008; Todorov, 2016)

. This deterioration of performance in identifying interacting features is primarily due to the fact that Relief’s computation of neighbors and weights becomes increasingly random as the number of features increases. This is an example of the curse of dimensionality. The iterative RBAs reviewed in Section

3.4 have been demonstrated to improve RBA performance in these types of large feature spaces. Differently, deteriorating performance in detecting main effects in very large feature spaces is primarily due to feature scores being based on feature value differences between a subset of neighboring instances, rather than differences from all instances. Thus the main effect signal in RBA scores is not expected to be as pronounced. Given a very large feature space, this less pronounced main effect feature score is less likely to stand out. In general, it is likely that myopic feature selection algorithms that compute scores by comparing the feature values of all training instances will have the most power to detect simple main effects (McKinney et al., 2013). In addition to an iterative RBA approach, another way to address lost main effect performance could involve running both an RBA and a myopic feature selection algorithm, selecting the top non-redundant set of features combined from both algorithms. Unfortunately, as of yet, there are no clear guidelines regarding the size of the feature space where (1) myopic methods would be expected to outperform RBAs in detecting main effects, or (2) interactions effects can’t be distinguished from random background noise. Simulations studies such as those reviewed in Section 3.6 offer some insight, however in real-world applications many factors are expected impact success (e.g. number of training instances, type of signal, signal strength, feature distributions, feature type, etc.).

Lastly, it’s notable that Relief scores do not reflect the nature of an association. For example Relief does not tell you which attributes might be interaction partners, or whether a score is high due to a linear effect or an interaction. This is left to downstream modeling. Furthermore, there is no established way to assess how many of the high scoring selected features may be false discoveries. It is possible this issue could be addressed through permutation testing as suggested by McKinney et al. (2013).

2.2 Feature Subset Selection

The original description of Relief specified an automated strategy for feature subset selection (Kira and Rendell, 1992b). Specifically, a relevance threshold () was defined such that any feature with a relevance weight [] would be selected. Kira and Rendell demonstrated that “statistically, the relevance level of a relevant feature is expected to be larger than zero and that of an irrelevant one is expected to be zero (or negative)”. Therefore generally the threshold should be selected such that . More precisely they proposed the bounds i.e. , based on Chebyshev’s inequality, where

is the probability of accepting an irrelevant feature as relevant (i.e. making a Type I error). If

is set too high, there is an increased chance that one or more relevant features will fail to be selected. Alternatively if is set too low, it is expected that an increased number of irrelevant features will be selected. Like any significance threshold the choice of is somewhat arbitrary. Not all features with a weight above the selected threshold will necessarily be relevant because it is expected that some irrelevant features will have a positive weight by chance.

In practice, rather than choosing a value of , it is often more practical to choose some number of features to be selected a priori based on the functional, computational, or run time limitations of the downstream modeling algorithms that will be applied. Ultimately the goal is to provide the best chance that all relevant features are included in the selected set for modeling, but at the same time, remove as many of the irrelevant features as possible to facilitate modeling, reduce over-fitting, and make the task of induction tractable.

2.3 Intuition, Interpretation, and Interactions

Relief algorithms often appear simple at first glance, but understanding how to interpret feature weights and gaining the intuition as to how feature dependencies (i.e. interactions) can be gleaned without explicitly considering feature subsets is not always apparent. The key idea behind Relief is to estimate feature relevance according to how well feature values distinguish concept (endpoint) values among instances that are similar (near) to each other. Two complementary interpretations of Relief feature weights have been derived and presented: (1) a probabilistic interpretation (Kononenko, 1994; Kononenko et al., 1996, 1997; Robnik-Šikonja and Kononenko, 2001, 2003) and (2) an interpretation as the portion of the explained concept changes (Robnik-Šikonja and Kononenko, 2001, 2003). Next, we summarize these interpretations and why they explain Relief’s ability to detect interactions.

2.3.1 Probabilistic Interpretation

The first interpretation is that the Relief weight estimate [ of feature is an approximation of the following difference of probabilities:

(3)

Consider that as the number of nearest neighbors used in scoring increases from 1 and approaches this effectively eliminates the condition that instances used in scoring be ‘near’. Notably if we were to eliminate the ‘near’ requirement from 3, the formula becomes:

(4)

As derived by Robnik-Šikonja and Kononenko (2003) it can be shown that, without the near condition, Relief weights would be strongly correlated with impurity functions such as the Gini-index gain. Impurity functions including information gain (Hunt et al., 1966), gain ratio (Quinlan, 1993), gini-index (Breiman et al., 1984), distance measure (De Mántaras, 1991), j-measure (Smyth and Goodman, 1992), and MDL (Kononenko, 1995) have often been used as myopic filter feature selection algorithms that assume features to be conditionally independent given the class.

Thus, it is the ‘nearest instance’ condition in Equation 3 and the resulting fact that Relief weights are averaged over local estimates in smaller parts of the instance subspace (rather than globally over all instances) that enables Relief algorithms to take into account the context of other features and detect interactions (Kononenko et al., 1997; Kononenko and Šikonja, 2008). It has been demonstrated by Robnik-Šikonja and Kononenko (2003), that as the number of neighbors used in scoring approaches the ability of Relief to detect feature dependencies disappears, since scoring is no longer limited to ‘near’ instances.

2.3.2 Concept Change Interpretation

The second interpretation of Relief weights has been argued as being the more comprehensible/communicable than the probabilistic one (Robnik-Šikonja and Kononenko, 2001). The authors demonstrate that Relief relevance weights [ can be interpreted as the ratio between the number of explained changes in the concept and the number of examined instances. If a particular change can be explained in multiple ways, all ways share credit for it in the quality estimate. Also if several features are involved in one way of the explanation, all of them get the credit in their quality estimate (Robnik-Šikonja and Kononenko, 2003). To illustrate this idea, Table 1 presents a simple Boolean problem where is determined by the expression () (), such that all three features (, , and ) are relevant.

Feature Values Score Change
Instances Responsible Features
1 1 1 1 0 0
1 1 0 1 or 0
1 0 1 1 or 0
1 0 0 0 or 0
0 1 1 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 0 ( and ) or
( and )
Total 0.75 0.1875 0.1875
Table 1: Tabular dataset description of Boolean problem = () () including the responsibility of each feature for yielding an expected class change. Adapted from Robnik-Šikonja and Kononenko (2003).

In the first instance of Table 1 it can be said that is responsible for class assignment because changing its value would be the only feature value change necessary to make . In the second instance, changing either or would make , thus they share the responsibility. Similar responsibility assignments can be made for instances 3 to 7, while in instance 8, changing only one feature value isn’t enough for to change, however there are two pairs of feature value changes that can. As detailed by Robnik-Šikonja and Kononenko (2003), adding up the responsibility for each feature results in a score estimate of 0.75 for , and an estimate of 0.1875 for both and . This result makes sense given that clearly has a stronger linear association with (i.e. a main effect), but both and contribute to a lesser extent, interacting with in a subset of instances. This conceptual example was validated empirically by Robnik-Šikonja and Kononenko (2003), finding in a dataset with these three relevant features along with five random binary features, that the Relief relevance estimate for converges near 0.75, and estimates for and converge near 0.1875 with an increasing number of training instances.

2.3.3 Breaking Down Interaction Detection

To further clarify how Relief detects interactions, Table 2 offers a simple example dataset that we will use to walk through Relief scoring. In this example, and are relevant features. When they have different values, the otherwise the . This is an example of a ‘pure’ interaction, where no individual feature has an association with endpoint. is an irrelevant feature.

Instances
1 0 1 1
1 0 0 1
0 1 1 1
0 1 0 1
0 0 1 0
0 0 0 0
1 1 1 0
1 1 0 0
Table 2: Example dataset with interaction between and . is irrelevant. Adapted from (Kononenko et al., 1997).

Table 3 breaks down how scoring would proceed over 8 cycles with each instance getting to be the respective target. For each target, we see what instance is the nearest hit and miss, as well as what feature has a different value between the instances (given in parentheses), and thus is relevant to scoring. If there is a tie for nearest neighbor, both instances are listed with their respective different valued feature. For example when is the target, its nearest hit is . The only feature with a different value between these two instances is . The nearest miss for is a tie between and that have feature value differences at and , respectively.

Nearest
Target Hit Miss
,
,
,
,
,
,
,
,
Table 3: Breakdown of Relief nearest neighbors (i.e. hits and misses) and corresponding feature value differences given in parentheses when a given instance from Table 2 is the target.

Table 4 summarizes the resulting number of nearest hit and miss score contributions from the Table 3. When there is a tie between instances for nearest neighbor, we give each feature difference half credit since only one can contribute at a time. We can see from Table 3 that among nearest hits we observe no feature value differences for or , but a total of 8 of them for across all 8 cycles. Among nearest misses, we observe 8 feature value differences for both and . However since it would be only one or the other each scoring iteration they each receive a total of 4. Lastly, the Relief scoring scheme applies negative scoring to nearest hits, and positive scoring to nearest misses. As seen in this simple example, Relief easily differentiate between the relevant interacting features and (each with a final score of 4) and the irrelevant feature (with a final score of ).

Relief Scoring
Nearest Hit 0 0 8 - 1
Nearest Miss 4 4 0 +1
Relief Score Total 4 4 -8
Table 4: Summary of score contributions in 2-way epistasis problem yielding Relief scores.

3 A Review of Relief-based Algorithms

In this section, we offer a comprehensive review of the RBAs inspired by Relief highlighting major research themes, advancements, concepts, and adaptations.

3.1 ReliefF: The Best Known Variant

The original Relief algorithm (Kira and Rendell, 1992b) is rarely applied in practice anymore and has been supplanted by ReliefF (Kononenko, 1994) as the best known and most utilized RBA to date. Notably, the ’F’ in ReliefF refers to the sixth algorithm variation (from A to F) proposed by Kononenko (1994). The ReliefF algorithm has been detailed in a number of other publications (Kononenko et al., 1996, 1997; Robnik-Šikonja and Kononenko, 2003). Here we highlight four key ways that ReliefF differs from Relief. First, ReliefF relies on a ‘number of neighbors’ user parameter that specifies the use of nearest hits and nearest misses in the scoring update for each target instance (rather than a single hit and miss). This change increased weight estimate reliability, particularly in noisy problems. A of 10 was suggested based on preliminary empirical testing and has been widely adopted as the default setting. This algorithm variation was originally proposed under the name ReliefA.

Second, three different strategies were proposed to handle incomplete data (i.e. missing data values). These strategies were proposed under the names Relief(B-D). When encountering a missing value, the ‘best’ approach (ReliefD), sets the diff

function equal to the class-conditional probability that two instances have different values for the given feature. This is implicitly an interpolation approach.

Third, two different strategies were proposed to handle multi-class endpoints. These strategies were proposed under the names ReliefE and ReliefF. ReliefF, which inherited the changes proposed in ReliefA and ReliefD, was selected as the ‘best’ approach. During scoring in multi-class problems, ReliefF finds nearest misses from each

‘other’ class, and averages the weight update based on the prior probability of each class. Conceptually, this encourages the algorithm to estimate the ability of features to separate all pairs of classes regardless of which two classes are closest to one another. Lastly, since it is expected that as the parameter

approaches the total number of instances , the quality of the weight estimates becomes more reliable, Kononenko (1994) proposed the simplifying assumption that . In other words, every instance in the dataset gets to be the target instance one time (i.e instances are selected without replacement). We adopt this assumption in deriving the time complexity of RBAs below. This is why the asymptotic time complexity of ‘core’ Relief algorithms are given as , rather than .

Focus Continuous F. Multi-class Regression Missing Data
Algorithm, Reference(s), & Time Complexity
Description/Contribution
(Closest Parent Algorithm) Asymptotic ()
Complete (Approx.)
Relief (Kira and Rendell, 1992b, a) The first ‘filter’ feature selection algorithm C X
sensitive to feature dependencies.
ReliefA (Kononenko, 1994) Introduced nearest neighbor scoring to C X
address noisy data. (Relief)
Relief(B-D) (Kononenko, 1994) Strategies for handling incomplete data. D X X
ReliefD selected as ‘best’. (ReliefA).
Relief(E-F) (Kononenko, 1994) Strategies for multi-class endpoint. D X X X
ReliefF became the standard. (ReliefD)
RReliefF (Kononenko et al., 1996) Adapting to regression problems. (Robnik-Šikonja and Kononenko, 1997) D X X X
Adopts exponential instance weighting by distance from C
target. (ReliefF)
Relieved-F (Kohavi and John, 1997) Deterministic neighbor selection and C X X X
incomplete data handling. (ReliefF) D
Iterative Relief (Draper et al., 2003) Address bias against non-monotonic C X
features. The first iterative approach. Neighbors uniquely I
determined by radius () and instances weighted by
distance from target. (Relief)
I-RELIEF (Sun and Li, 2006; Sun, 2007) All instances sigmoidally weighted by C X X
distance from target, i.e. no defined neighbors. Proposed I
online learning variant. (Sun et al., 2010) Local-learning updates
between iterations for improved convergence. (Iterative Relief)
TuRF (Moore and White, 2007) Address noise and large feature spaces with I
iterative removal of fixed percent of lowest scoring features.
(ReliefA)
Evaporative Cooling ReliefF (McKinney et al., 2007) Address noise and large I X
feature spaces with iterative ‘evaporative’ removal of lowest
quality features via (ReliefA) and mutual information (or
random forest scores) (McKinney et al., 2009). Privacy protection variant (Le et al., 2017).
EReliefF (Park and Kwon, 2007) Seeks to address issues related to incomplete D X X X
and/or multi-class data. (ReliefF)
VLSReliefF (Eppstein and Haake, 2008; Lee et al., 2015) Efficient interaction detection in large E
feature spaces by scoring random feature subsets. (ReliefA)
VLSReliefF (Eppstein and Haake, 2008) Iterative, TuRF-like extension of I
(VLSReliefF & TuRF). E
ReliefMMS (Chikhi and Benhammada, 2009) Feature weight relative to average feature C
diff between instance pairs. (ReliefA)
SURF (Greene et al., 2009) Threshold-based nearest neighbors for scoring. C
(ReliefA & Iterative Relief)
SURF* (Greene et al., 2010) Introduces ‘far’ scoring to improve detection of C
epistatic interactions. (SURF)
SWRF* (Stokes and Visweswaran, 2012) Extends (SURF*) with sigmoid weighting C
taking distance from threshold into account. Introduces
modular framework for Relief development (MoRF).
LH-RELIEF (Cai and Ng, 2012) Feature weighting by measuring the margin C X X

between the sample and its hyperplane. (

I-RELIEF)
I
MultiSURF* (Granizo-Mackenzie and Moore, 2013) Target instance defined neighbor threshold C
and dead-band no-score zone. (SURF*)
ReliefSeq (McKinney et al., 2013) Feature-wise adaptive k. Choice of impacts C X X X
differential detection of main effect vs. interaction. (ReliefA)
MultiSURF (Urbanowicz et al., 2018) Removed ‘far’ scoring from (MultiSURF*) C X X X X
to recover main effects. Added strategies to address data D
types as part of ReBATE. RBAs succeed with heterogeneity.
(The universal terms of RBA complete time complexities are given by )
is the number of training instances, is the number of features, and represents unique constant terms.
is the user specified number of nearest neighbors; and specify the number of hits and misses, respectively.
is the number of iterations (determined by user parameters)
and are the complete time complexities of ReliefF and VLSReliefF, respectively)
is the remaining number of features at iteration
(for TuRF or iVLSReliefF), and (for Evap. Cool. ReliefF)
is the number of features in each subset, is the number of subsets, and is where is replaced by
is the maximum considered by ReliefSeq
Table 5: Summary of key Relief-based algorithms.

3.2 Organizing RBA Research

Following ReliefF, a number of variations and improvements have been proposed. Table 5 chronologically organizes summary information on key RBAs dealing with fundamental feature selection problems. Brief descriptions of the algorithms and their contributions are given along with our designation of the closest parent algorithm in parentheses. Parent algorithms may deviate from what was described in the respective publication(s). This is due to inconsistent nomenclature (e.g. ReliefA implementations being more generically referred to as ReliefF even if they did not include extensions for missing or multi-class data handling) and some previously missed citations of relevant work. In the sections following this review we will adopt the name ‘ReliefF’ for any Relief algorithm that uses nearest neighbors, and makes the assumption (regardless of any data type handeling implementations), as has become common in the literature (Moore, 2015; Todorov, 2016).

Table 5 also provides asymptotic time complexities (highlighted in yellow), to easily compare run time order of magnitude. Additionally, based on algorithmic descriptions in the respective publications, we provide approximations of complete algorithm time complexities (highlighted in blue). These equations provide computational insight into algorithmic differences and reveal more subtle run time differences. Variables are defined below the table. Constants in the table are numbered (e.g. ) according to which additive term in the equation that it corresponds to (e.g. ). The term with initializes the feature weights to . The term with calculates all unique pairwise distances between instances. Given the assumption of , it is computationally more efficient to pre-compute all pairwise distances rather than on a target by target basis as proposed in the original Relief algorithm. The term wit corresponds to finding the nearest instances (or separates nearest from furthest in the case of SURF* and MultiSURF*). The term with corresponds with updating the feature weights. Algorithms that require additional terms label the corresponding constants generically with or a numbered variation. Note that complete time complexity terms that are universal to all RBAs are represented by within the table.

Todorov (2016) suggested that there are two primary directions of RBA development: (1) strategies for selecting and/or weighting neighbors in scoring (i.e. what we call ‘core algorithm’ developments), and (2) strategies for moving beyond a single pass over the data to ‘iterative’ implementations. In Table 5, the column labeled ‘Focus’ identifies the respective research direction(s) of the corresponding algorithm, going beyond the two suggested by Todorov. These include; (1) ‘C’ for core algorithm, i.e. variants impacting a single run through the training data such as variations in neighbor selection or scoring, (2) ‘I’ for iterative approach, i.e. variants designed to iteratively apply a core Relief algorithm for multiple cycles through the training data, (3) ‘E’ for efficiency, i.e. variants seeking to improve computational efficiency, and (4) ‘D’ for data type handling, i.e. variants that seek to address the challenges of different data types including continuous feature values, multi-class endpoints, continuous endpoints (i.e. regression), or missing data values. The remaining columns of Table 5 indicate (with ‘X’s’) whether the corresponding algorithm explicitly considered or implemented algorithm extensions to handle any of the four data types above, beyond discrete features and binary classes.

We can make some basic observations from this table. First, relatively little attention has been paid to adapting RBAs to regression problems. Second, the majority of proposed variations have focused on data with discrete-valued features and a binary endpoint. Notably many of these works have been application driven, focusing on feature selection in genomics problems with single nucleotide polymorphisms (SNPs) as features that can have one of three discrete values (0, 1, or 2) and a binary endpoint representing sick vs. healthy subjects (Moore and White, 2007; McKinney et al., 2007; Eppstein and Haake, 2008; Greene et al., 2009; McKinney et al., 2009; Greene et al., 2010; Stokes and Visweswaran, 2012; Granizo-Mackenzie and Moore, 2013). RBAs are particularly appealing in this domain since the number of features () in respective datasets is typically much larger than the number of available training instances (), and RBAs have a linear time complexity with respect to , but a quadratic time complexity with respect to . However, core RBAs aimed at SNP analysis, such as SURF (Greene et al., 2009), SURF* (Greene et al., 2010), SWRF* (Stokes and Visweswaran, 2012), and MuliSURF* (Granizo-Mackenzie and Moore, 2013) were not originally extended to handle other basic data types. Table 5 concludes with our recently proposed core algorithm named MultiSURF (Urbanowicz et al., 2018). MultiSURF performed most consistently across a variety of problem types (e.g. 2-way and 3-way interactions as well as heterogeneous associations) in comparison with ReliefF, SURF, SURF*, MultiSURF* and a handful of other non-RBA features selection methods. The work by Urbanowicz et al. (2018) also extended MultiSURF along with ReliefF, SURF, SURF*, and MultiSURF* to handle a variety of different data type issues under a unified implementation framework called ReBATE.

The following subsections go into greater depth describing notable RBAs that fall into our ‘core’, ‘iterative’, ‘efficiency’ or ‘data type’ categories, as well as peripheral RBA research directions not included in this table.

3.3 Neighbor Selection and Instance Weighting

This section references algorithms in Table 5 with a core focus (C). How do we select nearest hits and misses? What number of neighboring instances should be used in feature scoring? Is there information to be gained from considering ‘far’ instance pairs? How should the scoring contribution of those neighboring instances be weighted; also referred to as observation weighting by Todorov (2016)? These are the primary questions that have been asked in the context of core RBAs. Note that instance weighting refers to the weight placed on an instance during the scoring update. By default, most RBAs (including ReliefF) assign neighboring instances a weight of 1, and all others a weight of 0. Figure 3 illustrates how a variety of RBAs (arranged chronologically) differ with respect to neighbor selection and instance weighting. For every RBA, we assume that each instance in gets the opportunity to be the target instance during feature scoring. Note that for ReliefF in Figure 3, a of 3 is chosen for illustration simplicity, but a of 10 is most common. Figure 3 includes RBAs that adopt a ‘distance-from-target’ instance weighting scheme, i.e. Iterative Relief, I-RELIEF, and SWRF*, where instance weight ranges from 0 to 1. For all other RBAs in the table, instances that are identified as either near or far, have a full weight of 1, while all others have a zero instance weight in feature scoring. Three of the RBAs (i.e. I-RELIEF, SURF*, and SWRF*) are unique in giving all instances, besides the target, some weight each scoring cycle.

Figure 3: Illustrations of RBA neighbor selection and/or instance weighting schemes. Methods with a red/yellow gradient adopt an instance weighting scheme while other methods identify instances as ‘near’ or ‘far’ which then contribute fully to feature weight updates. These illustrations are conceptual and are not drawn to scale.

The original Relief algorithm used two nearest neighbors (i.e. one nearest hit and miss), each with an equal instance weighting (Kira and Rendell, 1992b) . ReliefA through ReliefF used nearest neighbors with equal instance weighting (Kononenko, 1994). Iterative Relief was the first to specify a radius around the target instance that would define the cutoff for which instances would be considered neighbors (Draper et al., 2003). Additionally, while RRelief (Kononenko et al., 1996) was the first to suggest differentially weighting instances based on their distance from the target instance in regression, Iterative Relief was the first to suggest this for discrete class problems (Draper et al., 2003). The effect there was that the closest neighbors had a greater impact on feature weighting than those out towards the edge of the radius. I-RELIEF proposed forgoing the determination of neighbors entirely, instead using an instance weighting function over the entire set of hit and miss instances, again so that the closest neighbors had the greatest impact on feature weighting (Sun and Li, 2006; Sun, 2007; Sun et al., 2010). Similar to Iterative Relief, SURF employed a distance threshold to define instances as neighbors (where was equal to the average distance between all instance pairs in the data) (Greene et al., 2009). However, in contrast with Iterative Relief, SURF utilizes equal instance weights for all instances defined as neighbors.

The SURF* expansion introduced the concept of instances that were near vs. far from the target instance (Greene et al., 2010) (see Figure 3). Applying the same from SURF, any instance within the threshold was considered near, and those outside were far. SURF* was similar to I-RELIEF in that all other instances besides the target contributed to scoring. This is reflected in the complete time complexity of the two algorithms where the feature scoring term is for both. However SURF* weights all ‘near’ instances equally, and all ‘far’ instances in a similarly equal, but opposite way. Specifically, for far instances, feature value differences in hits receive a while feature value differences in misses receive a , i.e. the opposite scoring strategy than what is presented in Figure 2. Note that in mathematics the ’*’ indicates opposite, therefore RBAs that utilize ’far’ scoring have been given this affix. Some publications have instead have used the affix ’STAR’ (e.g. SURFSTAR).

SWRF* integrated concepts from SURF* and I-RELIEF, preserving the definition of near and far established in SURF*, but adopting a sigmoid instance weighting function from I-RELIEF so that the nearest of neighbors have the greatest standard scoring weight, while the farthest ‘far’ instances have the greatest opposite scoring weight. Instances near

have the smallest scoring weights. The width of the SWRF* sigmoid function is proportional to the standard deviation

of all pairwise instance distances. In contrast with SWRF*, MultiSURF* took an alternate approach to discounting instances near by introducing a dead-band zone on both the near and far side of (i.e. or ) (Granizo-Mackenzie and Moore, 2013). Any instances that fell within this ‘middle’ distance zone were excluded from scoring (i.e. neither near or far). Another major difference is that MultiSURF* defined as the mean pairwise distance between the target instance and all others, as opposed to the mean of all instance pairs in the data. This adapts the definition of near/far to a given part of the feature space. Similarly, the width of the dead-band zone is the standard deviation of pairwise distances between the target instance and all others. One final difference between MultiSURF* and SURF* is that the ‘far’ scoring logic was inverted to save computational time. Specifically in SURF*, differences in feature values in hits yielded a reduction in feature score, and an increase in misses. Since different feature values are expected to be more frequent in far individuals, in MultiSURF*, same feature values in hits yielded an increase in feature score, and a decrease in misses. Also recognizing the importance of neighbor selection, ReliefSeq proposed the concept of an adaptive for all features (McKinney et al., 2013). ReliefSeq effectively examines all possible values of up to a and for each feature, picking the that yields the largest feature weight in the final scoring. While more computationally intensive, the authors claim that varying k on a feature by feature basis provides greater flexibiliy in detecting either main or interaction effects. Notably, ReliefSeq was applied to the analysis of RNA-Seq expression data. Most recently, MultiSURF was proposed, preserving most aspects of MultiSURF* but eliminating the ‘far’ scoring (Urbanowicz et al., 2018). This was due to the fact that while ‘far’ scoring improved the detection of 2-way interactions, it also greatly deteriorated the ability of RBAs to detect simple main effect associations. MultiSURF is claimed to balance performance with respect to its (1) ability to detect main or interaction effects, (2) computational efficiency, (3) ease of use (i.e. no parameters to set), and (4) applicability to a variety of data types.

3.4 Iterative and Efficiency Approaches

This section references algorithms in Table 5 with an iterative (I) or efficiency (E) focus. As noted earlier, core RBA performance is understood to degrade as the number of irrelevant features becomes ‘large’ particularly with respect to noisy problems. This has been observed or noted in a number of works (Robnik-Šikonja and Kononenko, 2003; Sun and Li, 2006; Sun, 2007; Moore and White, 2007; Eppstein and Haake, 2008; Greene et al., 2009; Todorov, 2016). As pointed out by Sun and Li (2006), this is because a core RBA defines nearest neighbors in the original feature space, which are highly unlikely to be the same in weighted space (i.e. the space where we have assigned low weights to features least likely to be relevant). To deal with this issue, iterative and efficiency approaches have been proposed that are wrapped around or integrated into core RBAs.

Iterative Relief introduced the idea of running the core RBA more than once, each time using the feature weights from the previous iteration to update pairwise distance calculations such that a low scoring feature from the previous iteration has less influence on instance distance in the current iteration (Draper et al., 2003) (see Figure 4). These ‘temporary’ feature weights were referred to as parameters by Todorov (2016) and designated by the variable . Iteratively updating the distance weights can cause certain samples to enter and leave neighborhoods of other samples. To reduce discontinuities in the feature weight estimates that arise from changing neighborhoods, Iterative Relief also introduced a radius to define neighborhoods rather than a set number of instances, as illustrated in Figure 3. Iterations continued until the weights converge, or until some maximum number of iterations is reached. It is important to be aware of stop criteria since iterative approaches can become quite computationally expensive.

Figure 4: Illustrations of the basic concepts behind key iterative and efficiency approaches including TuRF, Iterative Relief/I-RELIEF, and VLSReliefF. Features are represented as squares, where darker shading indicates a lower feature weight/score.

Sun and Li independently introduced another iterative Relief method known as I-RELIEF (Sun and Li, 2006). I-RELIEF adopts an iterative approach similar to Iterative Relief, but mathematically derived Relief as an online algorithm that solves a convex optimization problem with a margin-based objective function (Sun and Li, 2006; Sun, 2007)

. As such, I-RELIEF has been described as an outlier removal scheme since the margin averaging is sensitive to large variations

(Cai and Ng, 2012). Later, Local Learning I-RELIEF (our name for the unnamed algorithm) applied the concept of local learning to improve iterative convergence by promoting sparse feature weighting (Sun et al., 2010). ‘Sparse’ refers to there being a minimal number of converged feature weights with a value greater than zero. This was achieved by introducing the norm penalty (as in lasso) into optimization of I-RELIEF.

TuRF presents a much simpler iterative approach that can easily be wrapped around any other core RBA despite the fact that it was originally designed to be used with ReliefF (Moore and White, 2007) (see Algorithm 2). TuRF is essentially a recursive feature elimination approach. Each iteration, the lowest scoring features are eliminated from further consideration with respect to both distance calculations and feature weight updates (see Figure 4). However selecting the number of iterations () is not trivial. Evaporative Cooling ReliefF offers another novel approach that employs simulated annealing to iteratively remove lowest relevance features, where relevance is a function of both ReliefF and (myopic) mutual information scores (McKinney et al., 2007), or instead ReliefF and transformed random forest importance scores (McKinney et al., 2009). Most recently, the evaporative cooling concept was adapted to the challenge of patient privacy preservation, and was extended for continuous feature analysis (i.e. fMRI network data) (Le et al., 2017).

   number of attributes (i.e. features)
  Parameter: number of iterations
  
  for i:=1 to  do
     run ReliefF and estimate feature weights ()
     sort features by weight
     remove of remaining features with lowest weights
  end for
  return  last ReliefF weight estimates for remaining features
Algorithm 2 Pseudo-code of TuRF algorithm

It was noted by Todorov (2016), that for TuRF, or any of the other iterative approaches that could ‘remove’ features from consideration by giving them a of 0 in the distance calculation, it is still possible to estimate a relevance score W[A] for it (thus perhaps giving the feature the opportunity to be reintroduced as relevant later). It was warned that this could lead to undesirable oscillatory behavior and poor convergence of scoring. It should also be mentioned that any of the iterative strategies for updating parameters (e.g. I-RELIEF) could also be combined with a specific core RBA besides the one it was originally implemented with (e.g. the iterative component of Iterative Relief could be wrapped around the core SWRF* approach).

Despite the fact that core Relief algorithms are relatively fast, they can still be slow in very large feature spaces (common to bioinformatics), or more importantly, when large training sets are available (because RBAs scale quadratically with the number of instances). One of the more unique RBA proposals focuses on improving algorithm efficiency with regards to both run time and performance. Specifically, VLSReliefF targets the detection of feature interactions in very large feature spaces (Eppstein and Haake, 2008) (see Figure 4). The principle behind VLSReliefF is simply that weights estimated by ReliefF are more accurate when applied to smaller feature sets. Therefore it individually applies ReliefF to some number of randomly selected feature subsets (), each of size with the expectation that at least one subset in the population will contain all interacting features that are associated with endpoint (and will thus have elevated weights for those features). The partial ReliefF results are integrated by setting the ‘global’ feature weight W[A] to the maximum ‘local’ weight for a given feature across all ReliefF runs. With regards to detecting feature interactions, the risk of this approach is that if all relevant interacting features don’t appear together in at least one of the random feature subsets, then the interaction will likely be missed. That is why properly setting the and is critical to maximizing the probability of success. Furthermore, knowing the desired order of interaction to be sought (e.g. 2-way, 3-way) is needed to calculate . The VLSReliefF concept was inspired by work proposing a Random Chemistry ReliefF algorithm, an iterative approach that ran ReliefF on random feature subsets (Eppstein et al., 2007). The VLSReliefF concept could be integrated with other core RBAs. An iterative TuRF-like version called VLSReliefF has also been proposed (Eppstein and Haake, 2008).

3.5 Other Relief-based methods

This section references algorithms in Table 5 with a data type focus (D). In the interest of breadth, it also summarizes ancillary Relief expansions not included in Table 5. In a number of studies, emphasis has been placed on the handling of different data types beyond discrete features and binary classes. Beyond RReliefF (Kononenko et al., 1996; Robnik-Šikonja and Kononenko, 1997) little attention has been paid to handling continuous endpoints (i.e. regression) other than examples like FARelief (Bins, 2000), or RM-RELIEF (Li, 2014). Alternative methods of handling continuous features beyond those originally introduced in Relief were described by Demšar (2010) and Blessie and Karthikeyan (2011). An alternative method for handling a multi-class endpoint was described by Ye et al. (2008). Little else has been proposed for handling multi-class endpoints or missing data beyond adaptations of those from the original ReliefF algorithm (Kononenko, 1994).

The Relief concept has also been adapted to a variety of specific data problems. The most common problem is the removal of redundant features as discussed earlier. Previously, a handful of stand-alone RBAs, or some combination of an existing RBA with a redundancy removal heuristic have been proposed to deal with feature redundancy (Bins and Draper, 2001; Flórez-López, 2002; Guyon et al., 2003, 2005; Yang and Li, 2006; Chang, 2010; Mhamdi and Mhamdi, 2013; Zeng et al., 2013; Liu et al., 2015; Challita et al., 2015; Agre and Dzhondzhorov, 2016). Another popular area of investigation is the adaptation of Relief to multi-label learning, i.e. where instances can each have more than one class label assigned to it (Kong et al., 2012; Spolaôr et al., 2012, 2013; Slavkov et al., 2013; Pupo et al., 2013; Reyes et al., 2015). Other problems to which RBAs have been adapted include: multiple instance learning, i.e. bags of not clearly labeled instances (Zafra et al., 2010, 2012), dealing with non-monotonic relationships (Bins and Draper, 2002; Draper et al., 2003), dealing with survival data (i.e. data exploring the duration of time until one or more events happen) (Beretta and Santaniello, 2011), dealing with imbalanced data (Robnik-Šikonja, 2003), clustering (Dash and Ong, 2011), and feature extraction (Sun and Wu, 2008).

Other notable Relief methodological variations include approaches for feature set evaluation (Arauzo-Azofra et al., 2004), instance selection (Dash and Yee, 2007), and ensemble learning (Saeys et al., 2008; Zhou et al., 2014). Attempts at parallelizing RBAs for run time efficiency have been proposed by Lee et al. (2015) and Eiras-Franco et al. (2016). Many other works applying RBAs or drawing inspiration from them exist in the literature but are beyond the scope of this methodological review. Earlier reviews in the form of book chapters include (1) Kononenko and Sikonja’s focused examination of their own ReliefF and RReliefF contributions, (2) Moore’s brief review of ReliefF and select RBAs in the context of epistasis analysis, and (3) Todorov’s more recent summarial overview of target RBAs and advancements in the context of detecting gene-environment interactions (Kononenko and Šikonja, 2008; Moore, 2015; Todorov, 2016).

3.6 RBA Evaluations

The datasets chosen to test, evaluate, and compare RBAs in previous studies have often focused on (1) a small sample of simulated or toy benchmark datasets (Kira and Rendell, 1992b; Chikhi and Benhammada, 2009), (2) a set of real-world benchmarks (e.g. from the UCI repository) (Bins and Draper, 2001; Flórez-López, 2002; Qamar and Gaussier, 2012; Song et al., 2013; Gore and Govindaraju, 2016), (3) some real data analysis that is new or yet to be established as a benchmark (Dessì et al., 2013), or (4) some combination of these three (Kononenko, 1994; Robnik-Šikonja and Kononenko, 2003; Sun and Li, 2006; Sun, 2007; McKinney et al., 2007; Sun et al., 2010; Cai and Ng, 2012; Agre and Dzhondzhorov, 2016; Dorani and Hu, 2018).

Some RBAs have been compared across a spectrum of simulated datasets capturing a greater breadth of problem scenarios. This was true for TuRF, SURF, SURF*, SWRF*, and MulitSURF* each developed with the bioinformatic detection of epistastic interactions in mind (Moore and White, 2007; Greene et al., 2009, 2010; Stokes and Visweswaran, 2012; Granizo-Mackenzie and Moore, 2013). In each of these studies, RBAs were evaluated on datasets with purely epistatic 2-way interactions (i.e. no main effects) with varying numbers of training instances (e.g. 200 to 3200) as well as different heritabilities (e.g. 0.01 to 0.4). Heritability is a genetics term that indicates how much endpoint variation is due to the genetic features. In the present context heritability can be viewed as the signal magnitude, where a heritability of 1 is a ‘clean’ dataset (i.e. with the correct model, endpoint values will always be correctly predicted based on feature values), and a heritability of 0 would be a completely noisy dataset with no meaningful endpoint associations. All features were simulated as single nucleotide polymorphisms (SNP) that could have have a discrete value of (0, 1, or 2) representing possible genotypes. In each dataset, two features were predictive (i.e. relevant) of a binary class while the remaining 998 features were randomly generated, based on genetic guidelines of expected genotype frequencies, yielding a total of 1000 features. Similarly, VLSRelief explored SNP simulations and 2-way epistasis varying heritability similar to the other studies, but fixing datasets to 1600 instances and simulating datasets with either 5000 or 100,000 total features (Eppstein and Haake, 2008). It should be noted that most of these studies sought to compare core RBAs to respective iterative TuRF expansions, which is why larger feature spaces were simulated.

Another recent investigation compared ReliefF, TuRF, SURF, chi-square, logistic regression, and odds ratio, in their ability to rank features in SNP data simulated to include 15 epistatic feature pairs each contributing additively to class determination

(Dorani and Hu, 2018). RBAs again performed best both in this simulation, and in identifying interacting SNPs from a real world genome-wide association study (GWAS), confirmed by exhaustive calculation of information gain.

Beyond the simulated genetic analyses described above, there are only a couple examples of comparative evaluations of RBAs over a reasonably diverse set of synthetic datasets including one of Relief (Belanche and González, 2011) and another of ReliefF (Bolón-Canedo et al., 2013) in comparison with other feature selection approaches. Notably in both studies, the selected RBA stood out as the more reliable and successful feature selection algorithm, except when dealing with removing feature redundancy. Most recently, a much wider comparison of core RBA algorithms was completed over a broad spectrum of simulated datasets with various properties and underlying patterns; including main effects, interactions, and patterns of genetic heterogeneity (Urbanowicz et al., 2018). That study (1) confirmed the utility of RBA methods over chi-square, ANOVA, mutual information, and random forest based approaches for feature selection, (2) illustrated performance differences between a number of core RBAs (i.e. ReliefF, SURF, SURF*, MultiSURF*), and (3) introduced MultiSURF and novel implementations of ReliefF.

Clearly, the ultimate goal in developing feature selection methods is to apply them to real world problems and ideally facilitating the modeling of previously unknown patterns of association in that data. However, as similarly argued by Robnik-Šikonja and Kononenko (2003), Belanche and González (2011), Bolón-Canedo et al. (2013), and Olson et al. (2017): to properly evaluate and compare methodologies, diverse simulation studies should first be designed and applied. This is because: (1) Uniquely, a simulation study can be designed by systematically varying key experimental conditions, e.g. varying noise, the number of irrelevant features, or the underlying pattern of association in the data. This allows us to explicitly identify generalizable strengths and weakness of the methods and to draw more useful conclusions; (2) The ground-truth of the dataset is known, e.g. we know which features are relevant vs. irrelevant, we know the pattern of association between relevant features and endpoint, and we know how much signal is in the dataset (i.e. so we know what testing accuracy should be achievable in downstream modeling). This knowledge of ground truth allows us to perform power analyses over simulated dataset replicates to directly evaluate the success rate of our methodologies.

3.7 Software Availability

ReliefF (Kononenko, 1994) and its counterpart for dealing with regression data, i.e. RReliefF (Kononenko et al., 1996), are currently the most widely implemented RBAs. They can be found in the following freely available data mining software packages: CORElearn (Robnik-Sikonja and Savicky, 2012) (in C++), Weka (Hall et al., 2009) (in Java), Orange (Demšar et al., 2013), and R (Ihaka and Gentleman, 1996) (within the dprep and CORElearn packages). A C++ version of ReliefF is available as part of Evaporative Cooling ReliefF111https://github.com/insilico/EC/blob/master/src/library/ReliefF.cpp.

Separately, implementations of ReliefF (Kononenko, 1994), SURF (Greene et al., 2009), SURF* (Greene et al., 2010), and MultiSURF* (Granizo-Mackenzie and Moore, 2013) as well as the iterative TuRF algorithm (Moore and White, 2007) were made available in the open source Multifactor Dimensionality Reduction (MDR) (Ritchie et al., 2001) software package222http://sourceforge.net/projects/mdr. These Java implementations are computationally efficient, but can only handle ‘complete data’ (i.e. no missing values) with discrete features and a binary endpoint. Python 2.7 versions of these algorithms were later implemented and made available within the open source Extended Supervised Tracking and Classifying System (ExSTraCS)333https://github.com/ryanurbs/ExSTraCS_2.0 (Urbanowicz et al., 2014; Urbanowicz and Moore, 2015). These implementations were less computationally efficient, but extended each algorithm to handle different data types, including continuous features, multi-class endpoints, regression, and missing data. Other C implementations of ReliefF, SURF*, and SWRF* were made available as part of the modular framework for Relief development (MoRF)444https://github.com/mattstokes42/MoRF (Stokes and Visweswaran, 2012). A C implementation of ReliefSeq555http://insilico.utulsa.edu/ReliefSeq is also available (McKinney et al., 2013). Most recently, ReliefF, SURF, SURF*, MultiSURF*, MultiSURF, and TuRF were all implemented within the Relief-Based Algorithm Training Environment (ReBATE). These ReBATE implementations were coded more efficiently in Python (2 and 3) and similarly extended to handle the aforementioned data types. Stand-alone ReBate software666https://github.com/EpistasisLab/ReBATE and a scikit-learn (Pedregosa et al., 2011) compatible format777https://github.com/EpistasisLab/scikit-rebate were both recently made available.

4 Conclusion

In this work we have placed Relief-based algorithms (RBAs) in the context of other feature selection methods, provided an in-depth introduction to the Relief-algorithm concept, described four general branches of RBA research, and reviewed key methodological differences within these branches. This work highlights a number of conclusions that can be made about RBAs, including (1) they are generally proficient at detecting not only univariate effects but 2-way interactions as well, (2) they scale linearly well with the number of features, but quadratically with the number of training instances, (3) iterative and efficiency approaches offer a solution to scaling RBAs up very large feature spaces, (4) RBAs are ‘anytime’ algorithms, (5) the choice of instance neighbors is a critical aspect of RBA success, setting these methods apart from other feature selection approaches, (6) the individual feature weights output by an RBA can be used to probabilistically guide downstream machine learning methods (i.e. feature weighting), (7) RBAs have already been flexibly adapted to an array of data types and specific application domains, and (8) implementations of a variety of RBAs are available.

This promising area of feature selection will likely benefit from future research focusing on: (1) the most effective and reliable instance weighting approach (e.g. classic ‘full’ instance weighting, or ‘distance-from-target-based instance weighting) (2) Optimize the number of neighbors and neighbor selection to optimize RBA performance in a problem-dependent manner (3) improved strategies (e.g. iterative or efficacy) for scaling RBAs to large-scale data (i.e. many features and/or many instances), (4) adapting to other new problem domains (e.g. temporal data), (5) limiting or eliminating user defined RBA run parameters (to make them easier to apply, and require less prior knowledge about the problem domain to set correctly), and (6) new strategies for ensemble feature selection.

RBAs represent a powerful family of feature selection approaches that strike a key balance between ability to detect complex patterns, flexibility to handle different data types, and computational efficiency. While ReliefF has been the staple go-to algorithm of the family for many years, many advancements have since been made. Understanding these advancements is key to selecting the best approach for application as well as in guiding the development of even better feature selection approaches.

Acknowledgements

We thank the reviewers for their thoughtful comments. Special thanks to Brian Cole for his constructive feedback. This work was supported by National Institutes of Health grants AI116794, DK112217, ES013508, EY022300, HL134015, LM009012, LM010098, LM011360, TR001263, and the Warren Center for Network and Data Science.

References

  • Agre and Dzhondzhorov (2016)

    Agre, G., Dzhondzhorov, A., 2016. A weighted feature selection method for instance-based classification. In: International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer, pp. 14–25.

  • Aha et al. (1991) Aha, D. W., Kibler, D., Albert, M. K., 1991. Instance-based learning algorithms. Machine learning 6 (1), 37–66.
  • Almuallim and Dietterich (1991) Almuallim, H., Dietterich, T. G., 1991. Learning with many irrelevant features. In: AAAI. Vol. 91. pp. 547–552.
  • Arauzo-Azofra et al. (2004) Arauzo-Azofra, A., Benitez, J. M., Castro, J. L., 2004. A feature set measure based on relief. In: Proceedings of the fifth international conference on Recent Advances in Soft Computing. pp. 104–109.
  • Belanche and González (2011) Belanche, L. A., González, F. F., 2011. Review and evaluation of feature selection algorithms in synthetic problems. arXiv preprint arXiv:1101.2320.
  • Beretta and Santaniello (2011) Beretta, L., Santaniello, A., 2011. Implementing relieff filters to extract meaningful features from genetic lifetime datasets. Journal of biomedical informatics 44 (2), 361–369.
  • Bins (2000)

    Bins, J., 2000. Feature selection of huge feature sets in the context of computer vision. Computer Science. Fort Collins, CO, Colorado State University 156.

  • Bins and Draper (2001) Bins, J., Draper, B. A., 2001. Feature selection from huge feature sets. In: Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Vol. 2. IEEE, pp. 159–165.
  • Bins and Draper (2002) Bins, J., Draper, B. A., 2002. Evaluating feature relevance: Reducing bias in relief. In: JCIS. pp. 757–760.
  • Blessie and Karthikeyan (2011) Blessie, E. C., Karthikeyan, E., 2011. Relief-disc: An extended relief algorithm using discretization approach for continuous features. In: Emerging Applications of Information Technology (EAIT), 2011 Second International Conference on. IEEE, pp. 161–164.
  • Blum and Langley (1997) Blum, A. L., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artificial intelligence 97 (1), 245–271.
  • Bolón-Canedo et al. (2013) Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A., 2013. A review of feature selection methods on synthetic data. Knowledge and information systems 34 (3), 483–519.
  • Bradley and Mangasarian (1998) Bradley, P. S., Mangasarian, O. L., 1998. Feature selection via concave minimization and support vector machines. In: ICML. Vol. 98. pp. 82–90.
  • Breiman et al. (1984) Breiman, L., Friedman, J., Stone, C. J., Olshen, R. A., 1984. Classification and regression trees. CRC press.
  • Cai and Ng (2012) Cai, H., Ng, M., 2012. Feature weighting by relief based on local hyperplane approximation. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, pp. 335–346.
  • Callan et al. (1991) Callan, J. P., Fawcett, T., Rissland, E. L., 1991. Cabot: An adaptive approach to case-based search. In: IJCAI. Vol. 12. pp. 803–808.
  • Challita et al. (2015) Challita, N., Khalil, M., Beauseroy, P., 2015. New technique for feature selection: Combination between elastic net and relief. In: Technological Advances in Electrical, Electronics and Computer Engineering (TAEECE), 2015 Third International Conference on. IEEE, pp. 262–267.
  • Chandrashekar and Sahin (2014) Chandrashekar, G., Sahin, F., 2014. A survey on feature selection methods. Computers & Electrical Engineering 40 (1), 16–28.
  • Chang (2010)

    Chang, C.-C., 2010. Generalized iterative relief for supervised distance metric learning. Pattern Recognition 43 (8), 2971–2981.

  • Chen and Guestrin (2016) Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, pp. 785–794.
  • Chikhi and Benhammada (2009) Chikhi, S., Benhammada, S., 2009. Reliefmss: a variation on a feature ranking relieff algorithm. International Journal of Business Intelligence and Data Mining 4 (3-4), 375–390.
  • Cortizo and Giraldez (2006)

    Cortizo, J. C., Giraldez, I., 2006. Multi criteria wrapper improvements to naive bayes learning. In: IDEAL. Vol. 4224. Springer, pp. 419–427.

  • Dash and Liu (1997) Dash, M., Liu, H., 1997. Feature selection for classification. Intelligent data analysis 1 (1-4), 131–156.
  • Dash and Ong (2011) Dash, M., Ong, Y.-S., 2011. Relief-c: Efficient feature selection for clustering over noisy data. In: Tools with Artificial Intelligence (ICTAI), 2011 23rd IEEE International Conference on. IEEE, pp. 869–872.
  • Dash and Yee (2007) Dash, M., Yee, O. C., 2007. extrarelief: improving relief by efficient selection of instances. Lecture Notes in Computer Science 4830, 305.
  • De Mántaras (1991) De Mántaras, R. L., 1991. A distance-based attribute selection measure for decision tree induction. Machine learning 6 (1), 81–92.
  • Demšar (2010) Demšar, J., 2010. Algorithms for subsetting attribute values with relief. Machine learning 78 (3), 421–428.
  • Demšar et al. (2013) Demšar, J., Curk, T., Erjavec, A., Gorup, Č., Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., et al., 2013. Orange: data mining toolbox in python. The Journal of Machine Learning Research 14 (1), 2349–2353.
  • Dessì et al. (2013) Dessì, N., Pascariello, E., Pes, B., 2013. A comparative analysis of biomarker selection techniques. BioMed research international 2013.
  • Dorani and Hu (2018)

    Dorani, F., Hu, T., 2018. Feature selection for detecting gene-gene interactions in genome-wide association studies. In: International Conference on the Applications of Evolutionary Computation. Springer, pp. 33–46.

  • Draper et al. (2003) Draper, B., Kaito, C., Bins, J., 2003. Iterative relief. In: Computer Vision and Pattern Recognition Workshop, 2003. CVPRW’03. Conference on. Vol. 6. IEEE, pp. 62–62.
  • Eiras-Franco et al. (2016) Eiras-Franco, C., Bolón-Canedo, V., Ramos, S., González-Domínguez, J., Alonso-Betanzos, A., Touriño, J., 2016. Multithreaded and spark parallelization of feature selection filters. Journal of Computational Science 17, 609–619.
  • Eppstein and Haake (2008) Eppstein, M. J., Haake, P., 2008. Very large scale relieff for genome-wide association analysis. In: Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB’08. IEEE Symposium on. IEEE, pp. 112–119.
  • Eppstein et al. (2007)

    Eppstein, M. J., Payne, J. L., White, B. C., Moore, J. H., 2007. Genomic mining for complex disease traits with “random chemistry”. Genetic Programming and Evolvable Machines 8 (4), 395–411.

  • Flórez-López (2002) Flórez-López, R., 2002. Reviewing relief and its extensions: a new approach for estimating attributes considering high-correlated features. In: Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, pp. 605–608.
  • Gore and Govindaraju (2016)

    Gore, S., Govindaraju, V., 2016. Feature selection using cooperative game theory and relief algorithm. In: Knowledge, Information and Creativity Support Systems: Recent Trends, Advances and Solutions. Springer, pp. 401–412.

  • Granizo-Mackenzie and Moore (2013) Granizo-Mackenzie, D., Moore, J. H., 2013. Multiple threshold spatially uniform relieff for the genetic analysis of complex human diseases. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, pp. 1–10.
  • Greene et al. (2010) Greene, C. S., Himmelstein, D. S., Kiralis, J., Moore, J. H., 2010. The informative extremes: using both nearest and farthest individuals can improve relief algorithms in the domain of human genetics. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, pp. 182–193.
  • Greene et al. (2009) Greene, C. S., Penrod, N. M., Kiralis, J., Moore, J. H., 2009. Spatially uniform relieff (surf) for computationally-efficient filtering of gene-gene interactions. BioData mining 2 (1), 5.
  • Guyon et al. (2003) Guyon, I., Bitter, H.-M., Ahmed, Z., Brown, M., Heller, J., 2003. Multivariate non-linear feature selection with kernel multiplicative updates and gram-schmidt relief. In: BISC Flint-CIBI 2003 Workshop. Berkeley. pp. 1–11.
  • Guyon et al. (2005) Guyon, I., Bitter, H.-M., Ahmed, Z., Brown, M., Heller, J., 2005. Multivariate non-linear feature selection with kernel methods. In: Soft computing for information processing and analysis. Springer, pp. 313–326.
  • Guyon and Elisseeff (2003) Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature selection. Journal of machine learning research 3 (Mar), 1157–1182.
  • Hall et al. (2009) Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I. H., 2009. The weka data mining software: an update. ACM SIGKDD explorations newsletter 11 (1), 10–18.
  • Holland (1992) Holland, J. H., 1992. Adaptation in natural and artificial systems. 1975. Ann Arbor, MI: University of Michigan Press and.
  • Hong (1997) Hong, S. J., 1997. Use of contextual information for feature ranking and discretization. IEEE transactions on knowledge and data engineering 9 (5), 718–730.
  • Hunt et al. (1966) Hunt, E. B., Marin, J., Stone, P. J., 1966. Experiments in induction. Academic Press.
  • Ihaka and Gentleman (1996) Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. Journal of computational and graphical statistics 5 (3), 299–314.
  • Inza et al. (2000)

    Inza, I., Larrañaga, P., Etxeberria, R., Sierra, B., 2000. Feature subset selection by bayesian network-based optimization. Artificial intelligence 123 (1-2), 157–184.

  • Jin et al. (2006) Jin, X., Xu, A., Bie, R., Guo, P., 2006. Machine learning techniques and chi-square feature selection for cancer classification using sage gene expression profiles. In: International Workshop on Data Mining for Biomedical Applications. Springer, pp. 106–115.
  • Jović et al. (2015) Jović, A., Brkić, K., Bogunović, N., 2015. A review of feature selection methods with applications. In: Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2015 38th International Convention on. IEEE, pp. 1200–1205.
  • Kira and Rendell (1992a) Kira, K., Rendell, L. A., 1992a. The feature selection problem: Traditional methods and a new algorithm. In: AAAI. Vol. 2. pp. 129–134.
  • Kira and Rendell (1992b) Kira, K., Rendell, L. A., 1992b. A practical approach to feature selection. In: Proceedings of the ninth international workshop on Machine learning. pp. 249–256.
  • Kittler (1978) Kittler, J., 1978. Feature set search algorithms. Pattern recognition and signal processing.
  • Kohavi and John (1997) Kohavi, R., John, G. H., 1997. Wrappers for feature subset selection. Artificial intelligence 97 (1-2), 273–324.
  • Koller and Sahami (1996) Koller, D., Sahami, M., 1996. Toward optimal feature selection. Tech. rep., Stanford InfoLab.
  • Kong et al. (2012) Kong, D., Ding, C., Huang, H., Zhao, H., 2012. Multi-label relieff and f-statistic feature selections for image annotation. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, pp. 2352–2359.
  • Kononenko (1994) Kononenko, I., 1994. Estimating attributes: analysis and extensions of relief. In: European conference on machine learning. Springer, pp. 171–182.
  • Kononenko (1995) Kononenko, I., 1995. On biases in estimating multi-valued attributes. In: Ijcai. Vol. 95. pp. 1034–1040.
  • Kononenko et al. (1996) Kononenko, I., Robnik-Sikonja, M., Pompe, U., 1996. Relieff for estimation and discretization of attributes in classification, regression, and ilp problems. Artificial intelligence: methodology, systems, applications, 31–40.
  • Kononenko and Šikonja (2008) Kononenko, I., Šikonja, M. R., 2008. Non-myopic feature quality evaluation with (r) relieff. Computational Methods of Feature Selection, 169–191.
  • Kononenko et al. (1997) Kononenko, I., Šimec, E., Robnik-Šikonja, M., 1997. Overcoming the myopia of inductive learning algorithms with relieff. Applied Intelligence 7 (1), 39–55.
  • Langley (1994) Langley, P., 1994. Selection of relevant features in machine learning. In: Proceedings of the AAAI Fall symposium on relevance. Vol. 184. pp. 245–271.
  • Le et al. (2017) Le, T. T., Simmons, W. K., Misaki, M., Bodurka, J., White, B. C., Savitz, J., McKinney, B. A., 2017. Differential privacy-based evaporative cooling feature selection and classification with relief-f and random forests. Bioinformatics 33 (18), 2906–2913.
  • Lee et al. (2015) Lee, K.-Y., Liu, P., Leung, K.-S., Wong, M.-H., 2015. Very large scale relieff algorithm on gpu for genome-wide association study. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), p. 78.
  • Li (2014) Li, L., 2014. Relief for regression with missing data in variable selection. Ph.D. thesis, Citeseer.
  • Liu et al. (2015) Liu, X., Wang, X., Su, Q., 2015. Feature selection of medical data sets based on rs-relieff. In: Service Systems and Service Management (ICSSSM), 2015 12th International Conference on. IEEE, pp. 1–5.
  • Martínez and Kak (2001) Martínez, A. M., Kak, A. C., 2001. Pca versus lda. IEEE transactions on pattern analysis and machine intelligence 23 (2), 228–233.
  • McKinney et al. (2009) McKinney, B. A., Crowe Jr, J. E., Guo, J., Tian, D., 2009. Capturing the spectrum of interaction effects in genetic association studies by simulated evaporative cooling network analysis. PLoS genetics 5 (3), e1000432.
  • McKinney et al. (2007) McKinney, B. A., Reif, D. M., White, B. C., Crowe, J., Moore, J. H., 2007. Evaporative cooling feature selection for genotypic data involving interactions. Bioinformatics 23 (16), 2113–2120.
  • McKinney et al. (2013) McKinney, B. A., White, B. C., Grill, D. E., Li, P. W., Kennedy, R. B., Poland, G. A., Oberg, A. L., 2013. Reliefseq: a gene-wise adaptive-k nearest-neighbor feature selection tool for finding gene-gene interactions and main effects in mrna-seq gene expression data. PloS one 8 (12), e81527.
  • Menze et al. (2009) Menze, B. H., Kelm, B. M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., Hamprecht, F. A., 2009. A comparison of random forest and its gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC bioinformatics 10 (1), 213.
  • Mhamdi and Mhamdi (2013) Mhamdi, F., Mhamdi, H., 2013. A new algorithm relief hybrid (hrelief) for biological motifs selection. In: Bioinformatics and Bioengineering (BIBE), 2013 IEEE 13th International Conference on. IEEE, pp. 1–5.
  • Michalski (1983) Michalski, R. S., 1983. A theory and methodology of inductive learning. Artificial intelligence 20 (2), 111–161.
  • Mlambo et al. (2016) Mlambo, N., Cheruiyot, W. K., Kimwele, M. W., 2016. A survey and comparative study of filter and wrapper feature selection techniques. International Journal of Engineering and Science (IJES) 5 (8), 57–67.
  • Moore (2015) Moore, J. H., 2015. Epistasis analysis using ReliefF. Springer.
  • Moore and White (2007) Moore, J. H., White, B. C., 2007. Tuning relieff for genome-wide genetic analysis. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. Springer, pp. 166–175.
  • Narendra and Fukunaga (1977) Narendra, P. M., Fukunaga, K., 1977. A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers 9 (C-26), 917–922.
  • Ni (2012) Ni, W., 2012. A review and comparative study on univariate feature selection techniques. Ph.D. thesis, University of Cincinnati.
  • Olson et al. (2017) Olson, R. S., La Cava, W., Orzechowski, P., Urbanowicz, R. J., Moore, J. H., 2017. Pmlb: A large benchmark suite for machine learning evaluation and comparison. arXiv preprint arXiv:1703.00512.
  • Park and Kwon (2007) Park, H., Kwon, H.-C., 2007. Extended relief algorithms in instance-based feature filtering. In: Advanced Language Processing and Web Information Technology, 2007. ALPIT 2007. Sixth International Conference on. IEEE, pp. 123–128.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.
  • Pupo et al. (2013) Pupo, O. G. R., Morell, C., Soto, S. V., 2013. Relieff-ml: an extension of relieff algorithm to multi-label learning. In: Iberoamerican Congress on Pattern Recognition. Springer, pp. 528–535.
  • Qamar and Gaussier (2012) Qamar, A. M., Gaussier, E., 2012. Relief algorithm and similarity learning for k-nn. International Journal of Computer Information Systems and Industrial Management Applications (IJCISIM) 4, 445–458.
  • Quinlan (1993) Quinlan, J. R., 1993. C4. 5: Programming for machine learning. Morgan Kauffmann 38.
  • Reyes et al. (2015) Reyes, O., Morell, C., Ventura, S., 2015. Scalable extensions of the relieff algorithm for weighting and selecting features on the multi-label learning context. Neurocomputing 161, 168–182.
  • Ritchie et al. (2001) Ritchie, M. D., Hahn, L. W., Roodi, N., Bailey, L. R., Dupont, W. D., Parl, F. F., Moore, J. H., 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics 69 (1), 138–147.
  • Robnik-Šikonja (2003) Robnik-Šikonja, M., 2003. Experiments with cost-sensitive feature evaluation. In: European Conference on Machine Learning. Springer, pp. 325–336.
  • Robnik-Šikonja and Kononenko (1997) Robnik-Šikonja, M., Kononenko, I., 1997. An adaptation of relief for attribute estimation in regression. In: Machine Learning: Proceedings of the Fourteenth International Conference (ICML’97). pp. 296–304.
  • Robnik-Šikonja and Kononenko (2001) Robnik-Šikonja, M., Kononenko, I., 2001. Comprehensible interpretation of relief’s estimates. In: Machine Learning: Proceedings of the Eighteenth International Conference on Machine Learning (ICML’2001), Williamstown, MA, USA. San Francisco: Morgan Kaufmann. pp. 433–40.
  • Robnik-Šikonja and Kononenko (2003) Robnik-Šikonja, M., Kononenko, I., 2003. Theoretical and empirical analysis of relieff and rrelieff. Machine learning 53 (1-2), 23–69.
  • Robnik-Sikonja and Savicky (2012) Robnik-Sikonja, M., Savicky, P., 2012. Corelearn—classification, regression, feature evaluation and ordinal evaluation. The R Project for Statistical Computing.
  • Rokach (2010) Rokach, L., 2010. Ensemble-based classifiers. Artificial Intelligence Review 33 (1), 1–39.
  • Saeys et al. (2008) Saeys, Y., Abeel, T., Van de Peer, Y., 2008. Robust feature selection using ensemble feature selection techniques. Machine learning and knowledge discovery in databases, 313–325.
  • Saeys et al. (2007) Saeys, Y., Inza, I., Larrañaga, P., 2007. A review of feature selection techniques in bioinformatics. bioinformatics 23 (19), 2507–2517.
  • Slavkov et al. (2013) Slavkov, I., Karcheska, J., Kocev, D., Kalajdziski, S., Dzeroski, S., 2013. Extending relieff for hierarchical multi-label classification⋆. machine learning 4, 13.
  • Smyth and Goodman (1992) Smyth, P., Goodman, R. M., 1992. An information theoretic approach to rule induction from databases. IEEE transactions on Knowledge and data engineering 4 (4), 301–316.
  • Song et al. (2013)

    Song, Q., Ni, J., Wang, G., 2013. A fast clustering-based feature subset selection algorithm for high-dimensional data. IEEE transactions on knowledge and data engineering 25 (1), 1–14.

  • Spolaôr et al. (2012) Spolaôr, N., Cherman, E. A., Monard, M. C., Lee, H. D., 2012. Filter approach feature selection methods to support multi-label learning based on relieff and information gain. In: Advances in Artificial Intelligence-SBIA 2012. Springer, pp. 72–81.
  • Spolaôr et al. (2013) Spolaôr, N., Cherman, E. A., Monard, M. C., Lee, H. D., 2013. Relieff for multi-label feature selection. In: Intelligent Systems (BRACIS), 2013 Brazilian Conference on. IEEE, pp. 6–11.
  • Stokes and Visweswaran (2012) Stokes, M. E., Visweswaran, S., 2012. Application of a spatially-weighted relief algorithm for ranking genetic predictors of disease. BioData mining 5 (1), 20.
  • Sun (2007) Sun, Y., 2007. Iterative relief for feature weighting: algorithms, theories, and applications. IEEE transactions on pattern analysis and machine intelligence 29 (6).
  • Sun and Li (2006) Sun, Y., Li, J., 2006. Iterative relief for feature weighting. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp. 913–920.
  • Sun et al. (2010) Sun, Y., Todorovic, S., Goodison, S., 2010. Local-learning-based feature selection for high-dimensional data analysis. IEEE transactions on pattern analysis and machine intelligence 32 (9), 1610–1626.
  • Sun and Wu (2008) Sun, Y., Wu, D., 2008. A relief based feature extraction algorithm. In: Proceedings of the 2008 SIAM International Conference on Data Mining. SIAM, pp. 188–195.
  • Sutton and Matheus (1991) Sutton, R. S., Matheus, C. J., 1991. Learning polynomial functions by feature construction. In: ML. pp. 208–212.
  • Tang et al. (2014) Tang, J., Alelyani, S., Liu, H., 2014. Feature selection for classification: A review. Data Classification: Algorithms and Applications, 37.
  • Tibshirani (1996) Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.
  • Todorov (2016) Todorov, A., 2016. Statistical Approaches to Gene X Environment Interactions for Complex Phenotypes. MIT Press, Ch. An Overview of the RELIEF Algorithm and Advancements, pp. 95–116.
  • Urbanowicz et al. (2014)

    Urbanowicz, R. J., Bertasius, G., Moore, J. H., 2014. An extended michigan-style learning classifier system for flexible supervised learning, classification, and data mining. In: International Conference on Parallel Problem Solving from Nature. Springer, pp. 211–221.

  • Urbanowicz et al. (2012) Urbanowicz, R. J., Granizo-Mackenzie, D., Moore, J. H., 2012. Using expert knowledge to guide covering and mutation in a michigan style learning classifier system to detect epistasis and heterogeneity. In: International Conference on Parallel Problem Solving from Nature. Springer, pp. 266–275.
  • Urbanowicz and Moore (2015) Urbanowicz, R. J., Moore, J. H., 2015. Exstracs 2.0: description and evaluation of a scalable learning classifier system. Evolutionary intelligence 8 (2-3), 89–116.
  • Urbanowicz et al. (2018) Urbanowicz, R. J., Olson, R. S., Schmitt, P., Meeker, M., Moore, J. H., 2018. Benchmarking relief-based feature selection methods for bioinformatics data mining. Journal of biomedical informatics.
  • Van Laarhoven and Aarts (1987) Van Laarhoven, P. J., Aarts, E. H., 1987. Simulated annealing. In: Simulated annealing: Theory and applications. Springer, pp. 7–15.
  • Wolpert and Macready (1997) Wolpert, D. H., Macready, W. G., 1997. No free lunch theorems for optimization. IEEE transactions on evolutionary computation 1 (1), 67–82.
  • Yang and Li (2006) Yang, J., Li, Y.-P., 2006. Orthogonal relief algorithm for feature selection. In: International Conference on Intelligent Computing. Springer, pp. 227–234.
  • Ye et al. (2008) Ye, K., Feenstra, K. A., Heringa, J., IJzerman, A. P., Marchiori, E., 2008. Multi-relief: a method to recognize specificity determining residues from multiple sequence alignments using a machine-learning approach for feature weighting. Bioinformatics 24 (1), 18–25.
  • Yu and Liu (2004) Yu, L., Liu, H., 2004. Efficient feature selection via analysis of relevance and redundancy. Journal of machine learning research 5 (Oct), 1205–1224.
  • Zafra et al. (2010) Zafra, A., Pechenizkiy, M., Ventura, S., 2010. Feature selection is the relieff for multiple instance learning. In: Intelligent Systems Design and Applications (ISDA), 2010 10th International Conference on. IEEE, pp. 525–532.
  • Zafra et al. (2012) Zafra, A., Pechenizkiy, M., Ventura, S., 2012. Relieff-mi: An extension of relieff to multiple instance learning. Neurocomputing 75 (1), 210–218.
  • Zeng et al. (2013) Zeng, X., Wang, Q., Zhang, C., Cai, H., 2013. Feature selection based on relieff and pca for underwater sound classification. In: Computer Science and Network Technology (ICCSNT), 2013 3rd International Conference on. IEEE, pp. 442–445.
  • Zhao and Liu (2009) Zhao, Z., Liu, H., 2009. Searching for interacting features in subset selection. Intelligent Data Analysis 13 (2), 207–228.
  • Zhou et al. (2014) Zhou, Q., Ding, J., Ning, Y., Luo, L., Li, T., 2014. Stable feature selection with ensembles of multi-relieff. In: Natural Computation (ICNC), 2014 10th International Conference on. IEEE, pp. 742–747.
  • Zou and Hastie (2005) Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2), 301–320.