Machine learning algorithms are popular tools for classifying observations. These algorithms can attain high classification accuracy for datasets from a wide variety of applications and with complex behavior. In addition, through automated parameter tuning, it is possible to grow powerful models that can successfully predict class affiliations of future observations. A disadvantage, however, is that models can become overly complicated and, as a result, hard to interpret and expensive to evaluate for large datasets. Ideally we would like to generate models that are quick to build, cheap to evaluate, and that give users insight into the data, similar to how the size of coefficients in a linear regression model can be used to understand attribute-response relationships and dependencies.
Ensemble methods are a class of machine learning algorithms that develop simple and fast algorithms by combining many elementary models, called base learners, into a larger model. The larger model captures more behavior than each base learner captures by itself and so collectively the base learners can model the population and accurately predict class labels . Classical decision tree ensemble methods, such as bagging and boosting, are well known and have been tested and refined on many datasets [1, 8, 22]. In one such study, Banfield et al.  studied the accuracy of boosting and bagging on a variety of public datasets and found that in general neither bagging nor boosting was a statistically significantly stronger method.
In this paper, we modify, extend, and test an implementation  of the rule ensemble method proposed by Friedman and Popescu  for binary classification with bagging and with boosting. The Friedman and Popescu rule ensemble method is attractive, as it combines the rule weighting or variable importance that regression provides with the quick decision tree methods and collective decision making of many simple base learners. The method builds rules, that take the form of products of indicator functions defined on hypercubes in parameter space. The rules are fit by growing decision trees, as each inner node of a tree takes the desired form of a rule. The method then performs a penalized regression to combine the rules into a sparse model. The entire method resembles a linear regression model, but with different terms. Many ensemble methods provide little insight into what variables are important to the behavior of the system, but by combining the rules with regression, the rule ensemble method prunes rules of little utility and ranks remaining rules in order of importance.
We also modified the rule ensemble method to use various coefficient solving methods on a set of binary and multi-class problems. Previous implementations of this algorithm are either currently unavailable  or have not been fully tested on a wide set of problems . We extended the rule ensemble method to multiple class classification problems with one versus all classification  and tested it on classical machine learning datasets from the UC Irvine machine learning repository . These datasets were chosen because they have been used to test previous tree ensembles [1, 8, 18, 22, 29] and countless other machine learning algorithms. Finally, we look at different methods that can be used to solve for the coefficients and show how one can use the rule ensemble method to reduce the dimension of a problem. We give an example of identifying important attributes in a large scientific dataset by applying our techniques to a set of images of potential supernova .
1.1 Overview of Rule Ensemble Method
Suppose we are given a set of data points , where denotes the th observation, with label . Each of the observations, , has attributes or feature values that we measure for each observation. The matrix will denote the entire set of all ’s. The th feature of the th observation is the scalar . Our goal then is to be able to predict what class a future unlabeled observation belongs to. The method below focuses specifically on the binary decision problem where can be one of only two classes . To classify observations we seek to construct a function that maps an observation to an output variable that predicts the true label .
Define the risk of using any function that maps observations to labels as
where is the expectation operator.
is a chosen loss function that defines the cost of predicting a labelfor an observation when the true label is . While various loss functions have been developed, in practice we will use the ramp loss function as it is particularly well suited to the binary classification problem we consider [14, 16]. Within this framework we seek to find a function, , that minimizes the risk over all such functions
The optimal is defined on the entire population. However, we only have a training sample of observed data so we will construct an approximation to that minimizes the expected loss on this training set. We assume that the model has the form of a linear combination of base learners :
The next step is to find coefficients that minimize the risk (1). Like , the risk is defined over the entire population, so we will use the approximation that minimizes the risk over the given sample set of observations . In particular, we take to be the solution of
If the loss function, , is taken to be the mean squared error then this is simply a linear regression problem.
In many cases, a solution to equation (5) is not be the best for constructing a sparse interpretable model or a predictive model that is not overfit to the training data. Instead, one would like to have a solution that has as few components as possible. To achieve a sparse solution, a penalty term can be included that prevents less influential terms from entering the model. Here, we use the (lasso ) penalty and the approximation , which is the solution to the penalized problem
and enables both estimation of the coefficients as well as coefficient selection.
This section provided a brief introduction to the methods used in this study and that were developed by Friedman and Popescu [14, 15, 16]. Other papers provide more details and and justification of the rule ensemble method [14, 16] as well as the method that is used to assemble the rules in the latter part of the algorithm . Additional sources also provide more details for the other algorithms that we employed to compute the coefficients [13, 17, 28].
2 Base Learners
The base learners in equation (LABEL:eq:Fhat
) can be of many different forms. Decision trees, which have been used alone as classification models, have been used as base learners in ensemble methods such as bagging, random forests, and boosting. Decision trees are a natural choice to use for a learner, as many small trees (meaning each tree has few leaves) can be built quickly and then combined into a larger model. The bagging method grows many trees, then combines them with equal weights. Boosting is more sophisticated as it tries to build the rules in an intelligent manner, but it still gives each tree an equal weight in the ensemble .
2.1 Using Rules as Base Learners
In the rule ensemble method, simple rules denoted by are used as the base learners and take the form
where is an indicator function. The indicator function evaluates to if the observed attribute value is in the parameter space defined by , and if the observation is not in that space. Each is a constraint that the th rule assigns to the th attribute. For convenience we will denote
to be the vector of parameter constraints that an observation must meet to have theth rule to evaluate to . Note that a given rule can have multiple constraints on a single attribute, as well as a different number of constraints (indicator functions) than other rules. To emphasize that each rule is defined by a set of parameters we can write .
To fit a model we need to generate rules by computing parameter sets . In this study, we will use decision trees to generate rules, where each internal and terminal node (not the root node) of a decision tree takes the form of a simple rule defined by (7). Having means that the th rule is obeyed by the th observation and that it was sorted into the th node of the decision tree that generated the rule.
2.2 Tree Construction - Rule Generation
denote the set of rules contained in the th tree which has terminal nodes. Let
denote the prediction that the th tree makes for observation ; it is the evaluation of the rules in on .
Each tree is built on a random subset of observations , as training on the entire dataset can be expensive as well as overfit the tree. is a parameter that controls the diversity of the rules by defining the number of observations chosen to be in the subset . As subset size decreases, diversity increases with potentially less global behavior getting extracted. Diversity between the trees can also be increased by varying the final size of each tree. Clearly larger trees include more precise rules defining terminal nodes and thus are inclined to overtrain, but confining the size of a tree too strictly can prevent it from capturing more subtle behavior within the dataset. To avoid under or overfitting, we grow each tree until it has terminal nodes, where
is drawn from an exponential distribution. The distribution has mean, which does have to be selected a priori. The size of a tree is determined by growing each branch until no further nodes can be split because one of the following termination conditions has been met:
The number of observations in a terminal node is less than some selected cutoff,
The impurity of a node is less than a selected cut off,
The total number of nodes in the tree is greater than .
The splitting attribute and value is chosen as the split that minimizes the sum of the impurities (variance of the node) of the two child nodes if that split were taken. For each split only a random sample of attributes are considered in order to both increase the diversity of learners and decrease training time for huge datasets.
2.3 Gradient Boosting
To avoid simply regrowing overlapping rules, with no further predictive capability, we use gradient boosting to intelligently generate diverse rules. With gradient boosting, each tree is trained on the pseudo residualsof the risk function evaluated on the test set rather than training directly on the data . The th element of the pseudo residual vector in the th iteration is given by
for all . Each is a vector with as many entries as there are observations in the subsample on which it is evaluated. is the memory function at the th iteration. It gives a label prediction based on all the previous learners (trees) that were built. Note that is an intermediate model of trees that is used in rule generation, while is the final prediction model that has rules as linear terms. Training on the pseudo residuals allows one to account for what the previous trees were unable to capture. This method is similar to the method of regressing on residuals in multidimensional linear regression. Using pseudo residuals also provides another termination condition. If the pseudo residuals shrink below a chosen value, enough behavior has been captured and no further rules are generated. A shrinkage parameter, , controls the dependency of the prediction on the previously built learners. Using results in no dependence on past calculations, so that the next rule is built directly on the data labels and have had no part of the labeled value “accounted for” by dependence on previous calculations.
|Rule Generation Algorithm|
|Input: data where and|
|select random subset|
|select number of terminal nodes|
|calculate pseudo residuals with (8)|
|build tree on for all|
|End if small enough|
|total rules max|
|Return: rules =|
3 Weighting Rules
To combine the rules into a linear model, we need to approximate the coefficients defined in equation (6). Here we implement a method that approximates with an accelerated gradient descent method developed by Friedman and Popescu  and summarized in Figure 2. We will refer to this method as Pathbuild, as it does not solve (6) explicitly, but rather constructs by starting with a null solution and then incrementing along a constrained gradient descent path, distinguished by a parameter . Alternative algorithms for approximating will be discussed and compared later.
We would like find a value for the lasso penalty that yields the sparsest solution to (6) while maintaining a model with high accuracy. We initialize the coefficients to and find the constant intercept by the value that minimizes
This may be better understood by considering that will be the mean of when the loss is mean squared error. We approximate iteratively and calculate the st iteration, , by taking
We update the coefficients by
where is the gradient of calculated with the th iteration and evaluated on the entire dataset. The scaling parameter can be set constant or chosen at each step in a clever manner. Note that in equation (9) only a single component of the coefficient vector is updated at any iteration and thus only a single rule is able to enter the model at an iteration. The method only progresses in the direction of rules which have a large effect on the predictive capability and avoids steps that are of trivial effect. This condition may be relaxed by incrementing all of the components of that have a sufficiently large gradient
The parameter controls how large a component of the gradient must be relative to the largest component in order for a coefficient to be updated. Computing the gradient is expensive, but reorganizations and intelligent approximations to accelerate the computation are presented for three different loss functions in the appendix . The tricks used for this “fast” method are most effective for ramp loss and make Pathbuild a particularly attractive method.
In sections 6.2-6.4 we will compare Pathbuild with three different algorithms that can be used to solve for the coefficients. Each algorithm uses a slightly different formulation of the problem defined in equation (6) and a different technique to encourage a sparse solution that also has little risk. The three algorithms also use mean squared error to define loss rather than the ramp loss function that we use in Pathbuild.
|Pathbuild: Gradient Regularized Descent Algorithm|
|For max iterations|
|Stop if risk increased|
4 Datasets and Methods for Experiments
To test the behavior of the rule ensemble method on a binary classification problem, we used a dataset of images taken by a telescope [4, 20, 23], the goal being to identify potential supernovas. The initial data had three images for each observation. Those images were processed to yield 39 statistics for each observation that described the distribution and color of individual pixels within the original three images. These statistics became the attributes for the dataset and the observations were labeled with +1, -1 if they were or were not, respectively, an image of a supernova-like object. The dataset contains a total of 5,000 positive and 19,988 negative observations.
To test how the rule ensemble works on the binary classification problem, we use a procedure that first randomly selects 2,500 positive observations and 2,500 negative observations for a training set, and then uses the remaining data for the testing set. This selection process is repeated 10 times for cross-validation. False positive and false negative error rates were used to assess the accuracy of the methods in addition to the overall error rate. The false positive rate is the ratio of observations misclassified as positive to the total number of negative observations in the test set, while the false negative rate is the ratio of observations misclassified as negative to the number of positive observations in the test set. The overall error rate is the ratio of observations misclassified to the total number of observations in the test set. The experiments show the effect of the rule complexity (tree depth), number of rules available (tree size), and thresholding in Pathbuild on the accuracy of the method. We also consider the effect of substituting different coefficient solvers in place of Pathbuild.
To assess the overall utility of the rule ensemble we extend our numerical experiments to multi-class problems, which are described in section 5. We compare the rule ensemble with classical bagging and boosting methods by testing all three algorithms on 10 datasets from the UC Irvine Machine Learning Data Repository  with five 2-fold cross-validation tests. A 2-fold cross-validation test is similar to the method described above except that the dataset is split into equally sized subsets with the proportion of observations in each class the same in both subsets. Then one set is used for training and the other for testing, and then the sets are switched and retrained and retested. The datasets are briefly described in Table 1. The UC Irvine sets are chosen since they have been used in many machine learning studies [8, 22] and are used by Banfield et al.  to compare bagging with boosting. The UC Irvine sets are taken from a wide variety of applications, so they also present a good breadth of data to test the versatility of methods.
Experiments using the rule ensemble method were run using Matlab™7.10 on a MacBook Pro with a 2.66 GHz Intel Core i7 processor.
5 Multiple Class Classification Results
The rule ensemble method is designed for binary classification problems, but many datasets contain multiple classes that one needs to identify. To be applicable to classification in general, we need to extend the rule ensemble to many class problems. Decision trees easily extend to multiple classes but the regression performed to assemble the rules in the rule ensemble prevent the rule ensemble from being extended to classification problems where the classes are not ordered. To identify multiple classes with the rule ensemble method we use the one-versus-all (OVA) classification technique that has been used for successfully extending many binary classification algorithms into multi-class algorithms [19, 25]. Other methods for extending binary classification algorithms to multiple class problems exist, such as all-versus-all classification. However, these methods require a large number of models to be built and are thus more expensive than OVA and frequently provide no more utility than the OVA classification method .
For a problem with classes, OVA classification performs binary tests, where the th test checks if an observation is a member of the th class or not the th. Each observation gets a vector label prediction , where each entry is from the binary test classifying the th class versus any other class. The prediction is a vector of -1’s with a single positive entry. The index of the positive entry is the class that the observation is predicted to be from.
To extend the rule ensemble method we perform binary tests and each test returns a real valued prediction . In the binary problem the label is predicted to be the sign of the real value returned. However, in this setting it is possible that one of the binary models will misclassify the observation and result in being positive for more than one value of . If we just took the sign of each then we would have a vector with multiple positive entries, indicating the observation was in multiple classes. In the event that is positive for more than one value of , we take the prediction to be the class that has the most definitive prediction, i.e. the class where is greater than any other class label prediction. Choosing the largest label prediction is sensible, since the more confident the algorithm is that the observation is in a certain class, the closer to the label prediction will be. The closer to a class prediction is, the less certain the algorithm is of the observation’s class affinity.
Here we compare the rule ensemble method, using Pathbuild, with results from bagging and boosting tree ensemble methods. To compare we employ 10 datasets from the UC Irvine data repository  and the testing method parameters previously used to compare various ensemble methods . Bagging uses 1000 trees, boosting uses 50 and both employ random forests for growing trees in five 2-fold cross validations. Tree ensemble labels can be estimated by a voting procedure, the prediction is the class that most of the trees predict the observation to be part of, and an averaging procedure, the label is the average of the the predictions made by all the trees. Results for both methods are presented. Minimal tuning was used to run the rule ensemble method on different datasets.
5.1 Results Using OVA Classification on Vehicle Dataset
Figure 3 compares using the rule ensemble and bagging on the vehicle dataset. Bagging here is used in an OVA classification scheme rather than in its standard, direct multiple classification method. The error at predicting any given label in the set is shown. As can be seen in Figure 3, the rule ensemble beats bagging for the majority of the classes. Figure 3 also shows the varying level of success that the ensemble techniques had at predicting each class. Some classes are easier to identify than others (e.g. “opel” is easier to distinguish than van). Different ensembles were better suited to one class versus another, and which ensemble was better for a class was not consistent for all classes in a dataset.
5.2 Results Using OVA Classification on All Datasets
The results of the multiple class tests are given in Table 2. The rule ensemble is much stronger than the tree ensembles if averaging of each tree’s label prediction is used for classification. However, if the trees vote on which class label is best, then the rule ensemble is better on some datasets but not others. Voting clearly was better at label prediction than averaging base learner predictions, but neither boosting nor bagging provided a universal win over the rule ensemble, as can be seen in Figure 4. What is not apparent in Table 2 is that the rule ensemble was a much better predictor for binary labels than the tree ensembles. This result is apparent in Figure 3 where nearly every individual class is better predicted by the rule ensemble method. Figure 5 shows the accuracy of the rule ensemble method with different coefficient solvers. Some datasets are easier to classify (larger percent of data correctly classified) while others, such as the #2 dataset glass, were more difficult to classify for all the methods.
|Number of wins||1||1||1||3||0||5||0|
6 Binary Classification Results
6.1 Rule Ensemble with Pathbuild
Our implementation of the algorithm Pathbuild for approximating the rule coefficients in the rule ensemble method is described in Figure 2. The coefficients are found by solving equation (5) with a constrained gradient descent method. In this method, each iteration only advances in directions where the components of the gradient have magnitude greater than some fraction of the absolute value of the largest gradient component. Note that the set of directions we advance in,
can change at every iteration. By not advancing in directions that have little change in the risk function, the expense of updating coefficients for variables of little importance is avoided. Not updating rules of little importance prevents the coefficient value for that rule from “stepping” off zero, so that variable is effectively kept out of the model, allowing for a simpler model. Lower values of should include more rules in the model. The most inclusive model is when , which is equivalent to using a basic gradient descent method to get a standard regression. Larger values of decrease the total number of rules used in the model. The most constrained model occurs when .
Effect of Number of Rules and Tree Size
In Figure 6 we see how the size of the trees and the number of rules used for the model affect the accuracy of the model. The decision trees are used to generate rules. Larger decision trees yield more complex rules than small trees because large trees have nodes that are deeper. Nodes deep in a tree capture subtle interactions within the training data since they depend on more splits and are more complex than nodes that are closer to the root node. Figure 6 shows that ensembles built with larger trees have higher error rates than ensembles that use smaller trees. The increase in error rate when larger trees are built shows that when the model uses more complex rules, the model overfits the training data. However, the size of the trees does not have a strong effect on the how large of an error rate the rule ensemble has. Further, the accuracy of the rule ensemble is highly variable and the variance increases when larger trees are built. Ensembles built with trees that average 40 leaves had 4-7% error, which is a large range when one considers that the mean classification error is only about 5.5%. This error is larger than and has more variance than the error when trees with an average of 5 leaves are built, which is 3-4.2% error. It is not clear why there is so much variance in the error rate in general. One should recall that the average number of terminal nodes in the decision trees are exponentially distributed, only the mean of the distribution is changed, so there is a variety of sizes of trees in each ensemble and complexity between rules in each ensemble. Because there is a variety of tree sizes there is some stability in the error rate as the mean size of the trees is changed.
The bottom of Figure 6 also shows that using more rules can decrease the mean error rate of the rule ensemble method as well as the variance in the error rate. Increasing the number of rules built from 100 to 600 allowed the ensemble to capture more behavior and, as a result, nearly halved the error rate of the method. However, the error rate only decreases down to a certain point, after which adding more rules does not improve the model. For our data set, the error decreases to under 5.0% when 600 rules are built, but does not continue to decrease substantially when more than 600 rules are used. We also see that the error rates between ensembles that are built on more rules have less variance than the error rates from ensembles that are built out of fewer rules. This result is reasonable, as having more rules gives the ensemble a better chance of finding good rules that successfully separate the data into classes.
In the initial tree building phase, a subsample of data is selected and a tree is grown on each random subsample. Our initial experiments took subsamples of 2,500 observations ( of the total number of observations in the training set). When we decreased the subsample size to 500 observations ( of training set), error rates did not significantly change even for a variety of tree sizes that had between 5 and 80 terminal nodes. The lack of significant difference indicates that 500 observations give us a large enough sample to catch the same amount of behavior that is captured when larger subsamples of data are used to build each tree.
Effect of Using Rules Versus Linear Terms
In Figure 7 we see the effect of allowing the model to have linear dependencies on individual features. If only linear terms are used, then the model is a standard multiple linear regression. Allowing the model to be built with both linear terms and the rules generated by the trees yields a mixed model. Using rules for the regression terms provides a clear advantage over the standard regression model by reducing the error rate from nearly 30% error to less than 5%. The linear regression is also more biased in its error than the rule ensemble. This bias can be seen by the false negative rate being close to zero; this means nearly all the error is caused by mislabeling observations with negative labels. We would not expect a linear regression to capture any of the complex nonlinear behavior in the dataset, and the error rates show that such an conjecture is correct – rules are needed to get significant predictive capability.
Effect of Using the Threshold as Penalty
The variable controls how many directions are updated at each iteration of Pathbuild in the thresholded gradient descent method. The results of increasing are shown in Figure 8. The model becomes less accurate and the variance of the error rate increases, as increases. An increase in causes a higher threshold that results in fewer terms being included in each iteration of the coefficient finding method and a ensemble model that is less accurate. It is interesting to note that within a certain range, decreasing further does not offer much increase in the predictive capability of the model. In this example, we see that when is between 0 and 0.3 there isn’t a large increase in error rate. This indicates that using a weaker threshold of or even will not significantly compromise the accuracy of our model. This is a good result, as using a larger threshold decreases the computational expense of each iteration of the gradient descent method. The result that produces similar error rates to using means that we can get the same accuracy with less computation.
6.2 Rule Ensemble with Glmnet
In this experiment we use the Glmnet package , which returns approximations to solutions of elastic-net regularized general linear models, to solve for the coefficients within the rule ensemble method. Glmnet approximates a solution to the least squared error regression subject to an elastic net penalty, which is
with a coordinate-wise gradient descent method . The elastic net is defined as
for . When
the problem is referred to as ridge regression, and when we setwe get the same problem as in equation (6). The coordinate-wise gradient descent method starts with the null solution, similar to Pathbuild, then cycles over the coefficients and uses partial residuals and a soft-thresholding operator to update each coefficient one by one . Glmnet has some modifications that also allow some parameters to be associated with and updated at the same time as neighboring parameters. The null solution corresponds to solving equation (10) with . As the coefficients are updated, is decreased exponentially until the lower bound , the desired and pre-specified penalty weight, is met. Glmnet calculates a set of coefficients along each increment of the path to and uses the previous solution as a “warm start” to approximate the next solution. Note that should be small enough to prevent the penalty from being so large that it causes the vector to be overly sparse. However, should also be positive and large enough to ensure a sparse solution that is robust to the training data. A robust solution includes terms for interactions that are inherent to the application generating the data, not interactions that are only figments the subset selected for training. It is not clear how to pick the penalty weight to maintain sparsity of the solution and prevent overfitting while also capturing enough characteristics of the dataset.
Here we use the rules generated in the previous experiment with Glmnet and build models using the coefficients that are generated at each step of the path . Figure 9 shows how the accuracy of the method changes as the weight of the penalty used to find the coefficients changes. The solution with Glmnet when is small results in slightly less error than the solution with Pathbuild when is small. The variance in the error rates from solutions found with Pathbuild is less than the variance of error rates from solutions found with Glmnet
. Both solutions yield false positive rates that are more than twice as large as the false negative rates; this is probably a result of the ratio of positive to negative observations in the test set is small. The error rate slowly decreases asdecreases, but then the error rate stabilizes when is very small, . It is interesting that the variance in error rates of the solutions is relatively constant as changes.
6.3 Rule Ensemble with Spgl1
In this experiment, we used the Spgl1 (sparse projected-gradient 1) Matlab™package  to solve for the coefficients in
At each iteration of the algorithm, a convex optimization problem is constructed, whose solution yields derivative information that can be used by a Newton-based root-finding algorithm . Each iteration of the Spgl1 method has an outer/inner iteration structure, where each outer iteration first computes an approximation to . The inner iteration then uses a spectral gradient-projection method to approximately minimize a least-squares problem with an explicit one-norm constraint specified by . Some advantages of the Spgl1 method are that only matrix-vector operations are required and numerical experience has shown that it scales well to large problems.
The results using Spgl1 are shown in Figure 10. The accuracy of the Spgl1 solution increases when increases. The error rates are similar to those found by Pathbuild and Glmnet, but slightly higher than Glmnet even when is large.
6.4 Rule Ensemble with Fpc
In this experiment, we used a fixed point continuation method (Fpc)  that approximates the solution in
This problem formulation seeks to minimize the weighted sum of the norm of the coefficients and the error of the solution, the left and right terms respectively. The sparsity of is controlled by the size of the weighting parameter . Increasing places more importance on minimizing the error, and reduces the ratio of the penalty to the error. The reduction of penalty importance allows more coefficients to become non-zero (the norm of the coefficients to increase) and thus find a closer fit to the problem. Equation (12) is simply a reformulation of problem (6) with the lasso penalty, and is referred to as a basis pursuit problem in signal processing. The relation of the two problems can clearly be seen if, for any value, is chosen to be
and equation (12) is multiplied by . Fpc was developed for compressing signals by extracting the central components of the signal.
Fpc exploits the properties of the norm and declares three equivalent conditions for reaching an optimal solution. Fpc uses the reformulations of the optimality conditions to declare a shrinkage operator , where is a shrinkage parameter that has both an effect on the speed of convergence and how many non-zero entries has. The operator acts on a supplied initial value (which we chose to be the null solution) and finds our solution through a fixed point iteration
The given condition for the threshold of is
Fpc forms a path of solutions that starts with initialized to (where is a ratio of possible optimal square error at the next step to the square error at the current step). The parameter is altered at each step, which forces the shrinkage parameter to expand and contract but the upper bound for is supplied by the user. All results presented here use Fpc with projected gradient steps and optionally using a variant of Barzilai-Borwein steps .
The results of solutions generated by Fpc are shown in Figure 11. They are roughly as accurate as the solutions generated with the previous solvers. Fpc also has an explicit display of the thresholding as seen in Figure 13; the norm of the coefficients increases dramatically then asymptotically approaches a certain value. The asymptotic behavior is caused by the threshold constricting the coefficients and essentially preventing another coefficient from stepping off of zero. The thresholding is also seen in the error rate decreases as the weight on the mean squared error is increased, but stabilizes once the training set is reasonably fit. The value of where the error stabilizes is the value needed to build the model, but unfortunately it is not clear how to choose this value of a priori. The need for a selection of the penalty parameter is one of the difficulties that Fpc, Spgl1, and Glmnet have. Pathbuild shares a similar problem with the need to selection the gradient descent constriction parameter .
6.5 Identifying Important Attributes Via Rule Importance
Figure 11 shows that the rule ensemble method is quite successful at correctly classifying observations when all of the attributes are used to generate rules and build the model. Attributes have variable importance in the model and we suspect that not all of the 39 attributes in the full dataset are needed to model and correctly predict class labels. We want to use the rule ensemble method to select only the attributes that are important and save the expense of considering the other less important variables.
The importance of a rule is indicated by the magnitude of the coefficient for that rule. The larger a coefficient is in magnitude, the more important the corresponding rule is, since that rule will have a larger contribution to the model. To sift out the most important attributes, we look at which rules Fpc considered important at different values of . Rules are ordered by the magnitude of their corresponding coefficient and the rules corresponding to the 20 largest (in magnitude) coefficients are selected. An example of ordering the rules is in Table 3 where the 5 most important rules from one test are ordered. This process is continued for 5 different repetitions of training and testing, which yields 5 sets of 20 most important rules. The sets of rules are decomposed into sets of attributes that are used to make up the rules in each set. Then we let the 5 repetitions vote on which attributes are influential and keep only attributes that are in the set of important attributes for at least 3 out of the 5 repetitions. Figure 14 shows how many votes the highest ranking rules get and indicates that certain rules are important in all solutions while others are considered important in only some solutions. This set of attributes forms a smaller subset of the 39 attributes available in the initial dataset. The subset of rules only contains attributes that are used in at least one of the 20 most important rules in at least 3 of the 5 repetitions.
The importance of a rule is indicated by the magnitude of the coefficient for that rule. The larger a coefficient is in magnitude, the more important the corresponding rule is, as that rule will have a larger contribution to the model. To sift out the most important attributes, we look at which rules Fpc considered important at different values of . Rules are ordered by the magnitude of their corresponding coefficient and if a rule is one of the top 20 most important in a solution generated with a certain (13 values of we considered), then that rule receives a vote. An example of ordering the rules is in Table 3 where the 5 most important rules from one test with a given are ordered. Figure 14 shows for how many values of each rule was considered to be in the top 20 most important; this indicates that certain rules are important in solutions with all values of tried while others are considered important only when certain are used. This process is continued for 5 different cross-validation sets, which yields 5 sets of rules that were in the top 20 most important rules for at least one value of . The sets of rules are decomposed into sets of the attributes that were used to make up the rules in each set. Then we let the 5 repetitions vote on which attributes are needed to make the most influential rules and keep only the attributes that are in the set of important attributes for at least 3 out of the 5 repetitions. This set of attributes forms a smaller subset of the total attributes available in the initial dataset; it is the subset attributes that are used in at least one of the most important rules in at least 3 of the 5 repetitions.
For the supernova dataset, the smaller subset of attributes included only 21 of the 39 original attributes. Tests were repeated using the same cross-validation sets and method parameters as were used in Figure 11, but using only the smaller subset of 21 attributes to train on rather than all 39 attributes. Figure 15 compares the error rate of the method when 21 attributes were used with the error rate of the method when all 39 attributes were used. The results show that the accuracy of the method improves when we reduce the number of attributes used in the model. The method successfully ranks rules and identifies more important attributes. The method loses accuracy when the less important features are included; in essence, the extra attributes act as noise. After the method identifies these attributes as less important and we remove them, the method is able to return an even more accurate model and the insight of which attributes are not adding predictive capability to the model. Garnering better accuracy with fewer attributes may allow the extra attributes to be excluded from the data collection, which will save time in collecting data, save space in storing data, and allow an overall better analysis.
|-0.315 & 0.047||0.1045|
We compared several variations of a rule ensemble method with some well-known tree ensemble methods, namely boosting and bagging, on a variety of multi-class problems. We extended the rule ensemble to work on multi-class problem by using the OVA technique and found that with this extension the rule ensemble method performed comparably to the tree methods on a set of 10 classical datasets. This result highlights the power of the rule ensemble method, as we had expected the tree ensemble methods to do better on multi-class problems. Tree ensembles can use multi-class decision trees, which provide what one would think is a more natural extension to multi-class problems than using the OVA method. However, the rule ensemble method returned comparable rates of accuracy on most datasets and even performed better on some of the datasets. The discrepancy between the tree ensembles with voting and the rule ensemble was larger on problems that had a relatively large number of labels, such as the pendigits dataset, which had the most labels out of all the datasets, than on datasets with fewer labels. To improve the accuracy of the rule ensemble on problems with many classes, we would like to try using multi-class decision trees to build the rules and then relabel the nodes for each binary problem. This technique might yield better rules as it would allow for differentiation between the classes in the rule building phase. Better rules would then allow for a clearer separation of binary labels in the regression phase. This technique would also make the training phase more efficient as it would only require one set of rules to be constructed rather the as many sets of rules as there are classes.
We also looked at using 4 different methods to find coefficients to assemble the rules. All 4 methods present the challenge of needing to select a constraint parameter that controls the sparsity/accuracy trade-off of the solution that they return. If each parameter is chosen correctly then the methods are capable of producing coefficients that allow for similar accuracy in the model. The different approaches that the methods take for finding the coefficients do result in slightly different rankings of the rules. The difference in coefficients that each method considers important is shown in Figure 16. Ideally all solvers would select the same terms to be the most significant and would order the terms by importance the same way. Figure 16 shows that some rules that one method considers important are not considered to be important to another method. Fpc and Spgl1 order coefficients similarly, which is indicated by Spgl1 giving a significant magnitude to coefficients that Fpc also gives a significant magnitude to. Glmnet’s and Pathbuild’s ordering share less similarity with Fpc and Spgl1 as indicated by coefficients such as 9 and 18 that Glmnet and Pathbuild give a significant magnitude to, but both Fpc and Spgl1 give trivial values to. The difference in methods is also reflected in the sparsity of the solutions that they return. To achieve similar accuracy (taken here at 96% accuracy) Pathbuild returns a solution with 40-50% of the coefficients non-zero while the other methods return much sparser solutions that have only 12-19% of the coefficients non-zero. In general, Spgl1 returned the sparsest solutions and Pathbuild returned the least sparse solutions for models with similar error rates.
As a final step, we showed the utility of the rule ensemble method for identifying important attributes in a dataset containing images of potential supernovas. The rule ensemble method has the benefit over tree methods of providing insight into a dataset by returning weighted rules. Rules with large weights have a larger effect on the model and thus can be thought of as more important than other rules. We used the importance of such rules to alert us to the more significant features in the dataset by looking at which features the important rules are defined on. This technique allowed us to select 21 attributes out of the 39 available and reduce the error rate of the model by building models only on the reduced set of attributes. Traditional algorithms that use ensembles of decision trees, such as boosting and bagging, aren’t able to provide this insight into the importance of certain variables of a dataset because they do not rank or weight of rules.
The rule ensemble method has the advantage over some other methods by being able to identify relationships and hierarchies between variables to a certain extent when building the decision trees. The rules in the decision trees get more complex the deeper the tree is grown and also are able to have limited support in the parameter space, so they only affect certain observations that fall in that space. By including more variables, complex rules can be seen as resembling discrete correlations, and the post-processing of the rules allows for overly simplified correlations (that precede more complex rules in depth) to be removed from the model. The post-processing also allows for overly complex rules to be pruned from the model. Thus some variable interactions can be captured by the rule ensemble method without any a priori assumption that they exist, as is needed in standard regression models, and excessive computation is not spent considering correlations that do not exist.
We do not compare the computational efficiency of the rule ensemble method with tree ensemble methods here, since it is currently written in Matlab™, while the tree ensemble methods used are written in C. However, we do not expect that the rule ensemble method will reduce the amount of time necessary for the training portion of the algorithm to run because it must perform the coefficient solving method in addition to the tree growing. If the rule ensemble method is able to prune a substantial number of repetitive or unnecessary rules, then it is likely to run substantially more quickly than the tree methods. Comparing the time efficiency of the rule ensemble with other tree methods and other machine learning techniques will be part of future work. We do not present the computational efficiency of the coefficient solving methods used in the rule ensemble method for the same reason. Each solver is written in a different programming language, and each will have to be implemented in the same language and level of optimization before a meaningful study can be performed.
We would like to thank Sean Peisert and Peter Nugent for their valuable comments and suggestions.
Here we discuss the gradient method Pathbuild, which is described in section 3, in greater detail. Simplifications of the gradient method are presented and considered as the “fast method”.
Appendix A Derivation of the Negative Gradient of Risk
The negative gradient of the loss on the observations is found by taking partial derivatives of the sum of the loss on each observation with respect to each coefficient. The components of the negative gradient are given by
where . Note that as is the constant intercept that minimizes the risk when and all the other coefficients have not moved off their initial zero value. are the non trivial components of the gradients.
Note that the second term is easily computed from the linear form of and is given by
a.1 Negative gradient squared error ramp loss is used
The previous discussion has been generalized for the use of any loss function . Now consider the case when the loss function is given by
which is the squared error ramp loss for the -th observation. We want to find the derivative with respect to for this loss function. Begin by taking a partial derivative with respect to
Rearranging, switching the order of summation, and evaluating at the -th step in the approximation of we can write the gradient at the -th step as
a.2 Negative gradient with auxiliary functions ,
We need to keep track of the dependencies and update properly at each iteration. The goal of the method is to update the coefficients . We take a step with respect to and then update everything, so let act as the independent variable. Recall that is the index over the observations so is the attribute values for the -th observation and is the predicted value for that observation. This leaves us with
Defining the indicators
we can define a new function by
where and are scalars and . Using the two functions the negative gradient at the -th step (20) can be written in a simpler form
Appendix B Fast Algorithm
To “step” we move proportional to the largest component of the negative gradient (13). Let be the largest absolute component of the gradient
at the -th step. Then call the length of the next step and update the coefficients with . The coefficients at the -th step are
After a step the gradient must be recomputed before another step can be taken. Rather than fully recomputing an update can be applied only to the components of the gradient that are affected by the step. There are two cases of how the update to the gradient can be made. One update occurs when the step in the coefficients has caused indicator functions to change; this update requires more work and is expensive. The other update is cheap and is given as follows.
b.1 Case when indicators do not change
The step size should be small; in practice it is taken to be 0.01. The idea is that with a small stepsize will not exceed 1 “often.” On the steps where this is true the indicators do not change so do not change and the negative gradient at the -th step is found by substituting (23) into (22)
b.2 Case when indicators change - adjustments
If the assumption fails and the indicators change on a step, then and (24) does not hold. To find , consider the cases of how can change and and define the variable
can be thought of adding in observations where the indicators have turned on and subtracting observations where indicators have turned off. Using , can be adjusted
With a little more rearrangement the update to the gradient as
-  R.E. Banfield, L.O. Hall, K.W. Bowyer, and D. Bhadoria. A comparison of ensemble creation techniques. Lecture Notes in Computer Science, 3077:223–32, 2004.
-  J. Barzilai and J. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8:141–8, 1988.
-  C.L. Blake and C.J. Merz. UCI Repository of Machine Learning Databases. UC Irvine, 1998. http://www.ics.uci.edu/?mlearn/MLRepository.html.
-  JS Bloom, JW Richards, PE Nugent, RM Quimby, MM Kasliwal, DL Starr, D. Poznanski, EO Ofek, SB Cenko, NR Butler, et al. Automating discovery and classification of transients and variable stars in the synoptic survey era. Arxiv preprint arXiv:1106.5491, 2011. http://arxiv.org/abs/1106.5491.
-  L. Breiman. Bagging predictors. Machine learning, 24(2):123–40, 1996.
-  L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and regression trees. Chapman & Hall/CRC, 1998.
-  R. Brun and F. Rademakers. Root - an object oriented data analysis framework. Nucl. Inst. & Meth. in Phys. Res., pages 81–6, 1997. http://root.cern.ch/.
-  T.G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 4(2):139–57, 2000.
-  D. Donoho, I. Johnstone, G. Kerkuacharian, and D. Picard. Wavelet shrinkage; asymptotia? (with discussion). Journal Royal Statistical Society, 57(2):201–37, 1995.
-  Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learning- International Workshop then Conference, pages 148–56, 1996.
-  J.H. Friedman. RuleFit with R. Technical report, Department of Statistics Stanford University, 2005. http://www-stat.stanford.edu/~jhf/R-RuleFit.html.
-  J.H. Friedman, T. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. Annals ofApplied Statistics, 1(2):302–32, 2007.
-  J.H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Technical report, Department of Statistic Stanford University, 2008. http://www-stat.stanford.edu/~jhf/ftp/glmnet.pdf.
-  J.H. Friedman and B.E. Popescu. Importance sampled learning ensembles. Technical report, Department of Statistics Stanford University, 2003. http://www-stat.stanford.edu/~jhf/ftp/isle.pdf.
-  J.H. Friedman and B.E. Popescu. Gradient directed regularization. Technical report, Department of Statistics Stanford University, 2004. http://www-stat.stanford.edu/~jhf/ftp/pathlite.pdf.
-  J.H. Friedman and B.E. Popescu. Predictive learning via rule ensembles. Annals ofApplied Statistics, 2(3):916–54, 2008.
-  E. Hale, W. Yin, and Y. Zhang. A fixed-point continuation method for -regularized minimization with applications to compressed sensing. Technical report, Rice University CAAM, 2007. http://www.caam.rice.edu/~zhang/reports/tr0707.pdf.
-  T. Hastie, R. Tibshirani, and J.H. Friedman. Elements of Statistical Learning. Springer, 2001.
C.W. Hsu and C.J. Lin.
A comparison of methods for multiclass support vector machines.Neural Networks, IEEE Transactions on, 13(2):415–25, 2002.
-  N.M. Law and S.R. Kulkarni et al. The Palomar Transient Factory: System Overview, Performance, and First Results. Publications of the Astronomical Society of the Pacific, 121:1395–408, 2009. http://adsabs.harvard.edu/abs/2009PASP..121.1395L.
-  J. Meza and M. Woods. A numerical comparison of rule ensemble methods and support vector machines. Technical report, Lawrence Berkeley National Laboratory, 2009.
Bagging, boosting, and C4.5.
Proceedings Thirteenth American Association for Artificial Intelligence National Conference on Artificial Intelligence, pages 725–30, 1996.
-  A. Rau and S.R. Kulkarni et al. Exploring the Optical Transient Sky with the Palomar Transient Factory. Publications of the Astronomical Society of the Pacific, 121:1334–51, 2009. http://adsabs.harvard.edu/abs/2009PASP..121.1334R.
-  R. Rifkin and A. Klautau. In defense of one-vs-all classification. The Journal of Machine Learning Research, 5:101–41, 2004.
-  A.C. Tan, D. Gilbert, and Y. Deville. Multi-class protein fold classification using a new ensemble machine learning approach. Genome informatics series, pages 206–17, 2003.
-  R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58(1):267–88, 1996.
-  E. van den Berg and M. P. Friedlander. SPGL1: A solver for large-scale sparse reconstruction, 2007. http://www.cs.ubc.ca/labs/scl/spgl1.
-  E. van den Berg and M. P. Friedlander. Probing the pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890–912, 2008. http://link.aip.org/link/?SCE/31/890.
-  I.H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Pub, 2005.