Iterative subtraction method for Feature Ranking

06/13/2019 ∙ by Paul Glaysher, et al. ∙ CERN DESY 0

Training features used to analyse physical processes are often highly correlated and determining which ones are most important for the classification is a non-trivial tasks. For the use case of a search for a top-quark pair produced in association with a Higgs boson decaying to bottom-quarks at the LHC, we compare feature ranking methods for a classification BDT. Ranking methods, such as the BDT Selection Frequency commonly used in High Energy Physics and the Permutational Performance, are compared with the computationally expense Iterative Addition and Iterative Removal procedures, while the latter was found to be the most performant.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many measurements and searches for new phenomena performed by the experiments at the Large Hadron Collider (LHC) use a Boosted Decision Tree (BDT) to discriminate the physics process of interest (signal) from other physics processes with similar signature (background). The input variables (called features) to these BDTs are reconstructed from detector signals at different level of sophistication, hence forming low level and high level features. The variables are usually chosen based on the understanding of the physical processes. The BDTs are typically trained by supervised learning on labeled events of simulated signal and background processes using the Monte Carlo (MC) technique. The resulting trained BDT is applied to unlabelled data to obtain measurement results.


Knowing the relative importance of the input variables,

ranking the features, helps in various aspects. Firstly, it allows to reduce unnecessary dimensionality which is particularly important when dealing with small training samples. This is often the case when machine learning algorithms are used to classify physics processes which are CPU expensive to simulate and hence only a limited sample size exist for training and testing. Reducing dimensionality also helps for faster training. For example, the runtime complexity, i.e. the CPU time needed to construct a decision tree scales linearly with number of training variables

witten

. While this may still be manageable for BDTs, experience shows that the training time for other machine learning algorithms (ML) such as neural networks may significantly increase with the number of input variables used.

Feature ranking is also used as one possibility to gain insight into the underlying model of a physical process, i.e. the importance of the selected variables for the analysis. It allows for analysis optimisation such as validating the modelling of the inputs. Often potential training bias of the BDT response due to the particular MC generator used is estimated by using alternative MC simulations leading to a slightly modified BDT response which is then propagated into the uncertainty of the measurements. Feature ranking will lead to a better understanding of the source of this difference and help reduce the measurement uncertainties.

However, the question which training features are most important for the classification may not have a unique answer, in particular when the input variables are highly correlated. Ranking variables to reduce dimensionality can be probed with training BDTs on a sub-set of variables with algorithms optimised to find the optimal sub-set. While the importance of a variable for a given BDT classification might better be probed by using ranking algorithms estimating the effect of single variables on the classification of the BDT trained with the full set of input variables.

This paper studies various existing and new algorithm to select the best variables to be used for training.

2 Input variables and set-up

The current study is inspired by the example of a classification BDT used in the search for the process of top-quark-pair production in association with a Higgs boson (ttH) performed by the ATLAS experiment at LHC ttHbb . This search was performed in the Higgs decay channel where the Higgs decays to a pair of bottom () quarks. The signal events contain one electron or muon, at least six jets and 4 -quark jets. The dominant background is top-quark pair production in association with a -quark pair from gluon splitting which contains the same final state objects however, with slightly different kinematic properties.

We use MC samples provided by the HepSim Groupopendata . The ttH signal sample containing events was generated with MadGraph Alwall:2011uj matched to the Herwig6 parton shower Corcella_2001 . Two background samples were generated: events of top-pair production with additional light quarks using MadGraph matched to the Herwig6 and events of top-pair production with additional -quarks using MadGraph matched to Pythia6Sjostrand:2006za . The two background samples are orthogonal and are merged into one background sample with the different processes weighted by their cross section. The ATLAS detector response was simulated using Delphes simulation delphes . For this study, reconstructed jets and -quark jets (called -jets in the following) are used. The reconstructed

-jets have a 70% tagging probability. The corresponding light jet/c-jet rejection probability is parameterised according to

Aaboud:2018xwy .

Events selected for the BDT training were required to fulfil the following criteria:

  • one electron or muon with transverse momentum 20 GeV

  • at least 5 jets with 25 GeV

  • at least 3 -jets.

After this selection 700 000 signal events and 275000 background events remain. From these events two thirds are used for training a BDT and one third to test the BDT.

The choice of training variables is inspired by the reference analysis ttHbb , with a few additional variables and removing variables that could not easily be reconstructed from the available information. In total 26 input variables are considered ranging from basic objects like angular distance between different jets or leptons, mass of various jet and/or lepton systems, scalar sum of the of jets and leptons and the full event topology. The complete list of variables is given in Tab.1. Figure 1 shows distributions of input variables in the signal and the background sample. The separation, defined as the integral over the absolut value of the difference between signal and background, varies between 1% and 8 %. Figure 2 shows the correlation of the variables, ranging from almost no correlation to very high (anti-) correlation.

The TMVA tmva implementation of the BDT code is used with 400 trees, a maximal depth (”MaxDepth”) of 5 and the Ada boosting algorithm (”AdaBoostBeta=0.15) and 80 Cuts (”nCuts=80”).

dRbb_avg average dR of all -jet pairs
dRbb_MaxPt dR of the -jet pair with the highest sum of
dRbb_MaxM dR of the -jet pair with the highest invariant mass
dRlb1-dRlb3 dR of the charged lepton and the -jet with the 1st-3rd largest
dRlbb_MindR dR of the charged lepton and total -jet pair system which
has the smallest dR
dRlj_MindR minimum dR between the charged lepton and any jet
Mbb_MaxM maximum invariant mass of any -jet pair
Mbb_MindR invariant mass of -jet pair which has the smallest dR
Mbj_MaxPt invariant mass of two jets with the largest  sum,
where exactly one of the jets is a -jet
Mjjj_MaxPt invariant mass of any three jets with the largest sum
pT_lep transverse momentum of the charged lepton
HT_jets sum of transverse momentum of all jets
HT_all sum of transverse momentum of all jets and the charged lepton
nJets_Pt40 number of jets with GeV
nbTag number of -jets
nHiggsbb30 number of -jet pairs with an invariant mass
within 30 GeV of the Higgs boson mass of 125 GeV
MET missing transverse energy
dEtajj_MaxdEta largest difference in longitudinal angle of any two jets
Centrality_all ratio of momentum sum over the energy sum of all objects
H_all, H2_jets

1st-5th Fox Wolfram transverse moment

fox of all objects
Table 1: Input variables used for the BDT.
Figure 1: Input variables to the BDT from signal (blue) and background (red) samples, for variable definition see text.
Figure 2: Matrix of linear correlation coefficents of the input variables to the BDT.

In the following, the feature ranking of the input variables are compared using different methods. For each method, BDTs are trained for the set of the highest ranked variables and the area under the receiver operation curve (AUROC) is taken as the performance measure for comparison. The difference in ranking may lead to a different sub-set of variables from the full set of variables for a given fixed number of variables. However, for each method the list of variables used builds up sequentially for each algorithm, i.e. exact one variable is added to the existing sample going from to variables and hence defines the ranking. Only for the random selection the sub-samples for different number of variables don’t have to have any overlap as at each the random selection computed from all permutations for , and for 1000 randomly selected trials for .

3 Feature ranking algorithms

Different algorithms for ranking the importance of a feature (i.e. input variable) exist which largely vary in their methods. Some methods evaluate the variable importance by adding or subtracting input variables from or to a set of reference variables and measure the change in BDT performance. Other methods estimate the importance for a given set of variables based on the information used in the training of the BDT trained on all features. The choice of method may also depend on the particular use case. The methods vary largely in their computing needs, some are very computationally expensive.

For the first variables, all possible combinations are considered and the one with the best AUROC is selected (”maximum”). This corresponds to the best possible AUROC for the given number of variables. However, since the number of combinations raises fast, for only a random selection out of all combinations is choosen 1000 times, to limit CPU consumption on the BDT trainings. The median and the best AUROC of all trails is reported to serve as a reference.

Rank variables by overlap of their signal versus background predictions, i.e. the integral over their difference. This method does not involve any BDT training. For the comparisons presented here, the AUROC values are calculated from a BDT trained with the selected variables.

Rank the variables based on their correlation to the BDT score computed with all variables. Computationally cheap as it only involves only one BDT training with all variables.

Train the BDT on all variables and rank by how often a variable provided the optimal decision in the BDT tmva . Computationally cheap as it involves only one BDT training. This is the default ranking procedure implemented in the TMVA BDT code used here.

In order to avoid the high CPU costs of the iterative removal and gives insight into a black-box estimator for the set of variables, this method calculates the feature importance by replacing a feature with random noise instead of removing the feature. The random noise is drawn from the same distribution as original feature values but taken from other events feature values. This avoids out of range values which may lead to a failure of the algorithm but the values are random and uncorrelated to the events  permperf

The idea is to measure the importance by looking at how much the score increases when a feature is added. It starts with the single input variable with highest AUROC and successively adds the variable of the remaining variables with the highest AUROC, as is done e.g. in Aaboud:2018psm . This involves training the BDT for each of the combinations to determine the AUROC and find the best performance. The total number of BDTs to be trained are . However this ignores potential correlations between the added variables. For example, two correlated variables might only provide separation power when both are present in the training.

The idea is to measure the feature importance by looking at how much the score decreases when a feature is removed. This way, correlations between variables are better taken into account than for the additive method. However, since the set of variables is retrained, it shows what may be important within the dataset, not necessarily what is important within a concrete trained model.

This method starts with training on all variables and successively remove the variable that degrades the performance the least. As for the iterative addition, this involves training the BDT for each of the combinations to determine the AUROC and find the best performance, leading in total to the same number of trainings as for the iterative addition. However, since the method starts with a larger number of variables in the BDT, the overall CPU consumption is high and even higher than the iterative addition. The code is publicly available code .

4 Results

Figure 3: Comparison of performance of different feature ranking algorithms. The area under the receiver-operater curve (AUROC) is shown as a function of the number of variables. For details of the selected number of training variables, see text.
Rank Iterative Removal Permutation Importance BDT Selection Best
Frequency
1 dRbb_av dRbb_av dRbb_av dRbb_av
2 HT_jets HT_all Mbb_MaxM Mbb_MaxM
3 nHiggsbb30 H0_all HT_jets nHiggsbb30
4 Mbb_MaxM Mbb_MindR H0_all -
5 nbTag dRlj_MindR nJets_Pt40 -
6 Mbb_MinR Mbb_MaxM dRlb2 -
7 dRlb3 Centrality_all Mjjj_MaxPt -
8 H2_jets Mbj_MaxPt pT_lep -
9 H0_all HT_jets dEtajj_MaxdEta -
10 Mjjj_MaxPt H2_jets dRlb1 -
Table 2: Highest ranked variables for Iterative Removal, Permutation Importance, BDT Selection Frequency and best combination for up to 3 input variables. Even though the best combination is determined considering all combinations for each number of variables, the resulting best combination included the previously ranked variables.

All algorithms show similar performance if 24 out of 26 variables are used. However, the different algorithms approach this plateau of maximal AUROC differently as a function of number of variables. There are two groups of algorithms with similar performance: The first group of algorithms are the Iterative Removal, Iterative Addition, the BDT Selection Frequency and the Permutation Importance. These algorithms start with a higher AUROC and approach the plateau faster. Among these the newly proposed Iterative Removal performs best over the full range and approaches the plateau up to a level of 99% already with 12 variables. This is better than the similar algorithm of iterative addition and the Permutational importance which reach this performance only with 16 variables, or the BDT selection frequency which needs 17 variables. Between these methods the largest differences are observed between 5 and 16 variables. The better performance of the iterative removal comes at high CPU costs and it is interesting to note that Permutation importance which is computational cheap has the 2nd best performance overall and yields similar good results as iterative removal for more than 16 variables. The BDT selection frequency which is also computational cheap is only slightly worse than the Permutation importance.

The second group consists of the Median of the Random Selection, separation based and correlation based selections start with a low AUROC and only slowly approach the plateau with 24 variables. Among these algorithms, the separation based is the poorest ranking method over the full range as might be expected since this method ignores the correlation between the variables. The correlation based outperforms the random choice when approaching the plateau and has similar performance at low number of variables.

The Maximum of the Random Selection apparently largely depends on the randomly selected variables for 4 up to approximately 18 variables. 1000 trials is not enough to approximate the best result, and the dependence on the particular selection of variables is still large, hence more variables sometimes yield a smaller AUROC. The better reference is the median which shows steady raising AUROC. It is interesting to note that the separation based selection is lower than the median of the Random Selection for almost all numbers of variables below =15, indicating that separation is a not a good quantity for feature ranking in contrast to intuition.

When selecting the highest ranked variables to reduce dimensionality it is interesting to know how much the different ranking algorithms overlap. Table 2 lists the highest ranked variables for the best performing algorithms. It is worth noting that they all agree for the highest ranked variable, but largely vary in the order of the rest of the variables. The variations persist in the best 5 variables, but the Iterative Removal and the Permutation Importance have 6 variables in common among the best 10. Two out of the 4 variables that were not included in Permutation Importance have high linear correlation coefficients with variables that were in the list of Iterative Removal, there is a large overlap for 8 out of 10 variables. Similar conclusions hold for the comparison of BDT Selection Frequency with the Iterative Removal.

5 Conclusion

Different methods for ranking input variables for BDT classification were compared. The computationally most expensive method of Iterative Removal showed the best classification power measured in terms of AUROC. However, when selecting 16 out of 26 variables, other methods such as Permutation Importance and BDT Selection Frequency which are computationally very cheap give very similar results. Interestingly these methods select the same variable as being best. Difference in performance between 2 and 16 variables for the 3 methods are of the order of 1-2% in AUROC for the same number of variables and should be compared to the computational costs for the specific use case. The variables selected to be the best 10 agree to large extend between these two methods which may allow for reducing dimensionality with the less CPU expensive method. But the best 5 variables vary largely, which will make it difficult to draw firm conclusions on the impact of first few variables on the classification power.

References