1 Introduction
XGBoost is an advanced Gradient Tree Boostingbased software that can efficiently handle largescale Machine Learning tasks[1]. Merited by its performance superiority and affordable time and memory complexities, it has been widely applied to a variety of research fields since been proposed, ranging from cancer diagnosis[2] and medical record analysis[3] to credit risk assessment[4] and metagenomics[5]
. Also, because of its easytouse Python interface and explainable nature, it has become the de facto methodofthefirstchoice for a majority of data science problems. For instance, in the famous
Kaggle^{1}^{1}1https://www.kaggle.com/competitions competitions, many winning teams built their models based on XGBoost and expressed positive views on the method and its variations[1][6][7]. It could be tentatively predicted that in the near future, XGBoost and its variations will remain one of the mostapplied methods in the data science community.On the other hand, although XGBoost has achieved considerable success on both regression and classification problems, its performance often becomes subtle when a situation of labelimbalanced classification emerges. There have been mixed reports on the capabilities of XGBoost in handling labelimbalanced data. For example, (Zhao et al., 2018)[8]
demonstrated through their experiments that XGBoost can outperform other methods on skewed data sets, while the figures in (Luo et al., 2018)
[9] suggested that vanilla XGBoost must be combined with other ensembling methods to achieve satisfactory results. It is noticeable that XGBoost is not designed for labelimbalanced data, and to be fair, most ’vanilla’ Machine Learning algorithms suffer performance decline when the ratio between labels becomes biased. However, given the popularity of XGBoost and the fact that labelskewed data is, unfortunately, commonly encountered in practice, this performance decay will still leave significant negative effects on related research and applications.This paper introduces imbalanceXGBoost, an XGBoostbased Python package addressing the labelimbalanced issue in the binary label regime by implementing weighted (crossentropy) and focal losses on the boosting machine. Weighted crossentropy loss is one of the simplest algorithmlevel costsensitive methods[10] for learning imbalanced data. It follows the straightforward idea to increase the penalization of misclassifying certain classes, and it has been widely applied to adjust vanilla machine learning algorithms to the labelimbalanced domain[11]. In contrast, focal loss[12] is a relatively novel method originated from research in object detection. The idea of the method is to add a factor to the crossentropy function (where is the prediction of
), and this will reduce the importance of the wellclassified data points. Comparing with weighted crossentropy, focal loss enjoys a more robust parameter configuration as the method will work in our favor as long as
.To the best of the authors’ knowledge, there has not been significant publication discussing the implementation of the two losses on XGBoost previously. Existing studies on XGBoost under labelimbalanced scenarios usually adopt datalevel approaches such as resampling[13] and/or costsensitive loss with nontrainable a priori modifications[14]. (Chen et al., 2017)[15] mentioned weighted XGBoost in their work, but details regarding the implementation are not presented. A major challenge in applying the two loss functions to XGBoost is that to approximate the incremental learning objective, first and secondorder derivatives of the loss function with respect to the predictions must be presented (One can refer to section 3 for more details on this). And an algebraic contribution of this paper is the derivations and implementations of the derivatives that enable the two losses to be run with XGBoost.
The package is written in Python with hard dependency on XGBoost, Numpy[16], and Scikitlearn[17]. The losses are integrated into the XGBoost system by the customized loss framework of the software, provided the derived expressions of the derivatives. Since the major methods in the program are included in the dependency graph, the core part of the package is of small scale, with only a few hundred lines of Python codes. Nevertheless, the function derivatives and implementations and the significance in practical applications make the work nontrivial. The main class (containing the methods) is designed as a child class of classes BaseEstimator and ClassifierMixin of Scikitlearn, and this enables most data science methods in Scikitlearn to be applied to the corresponding object with trivial efforts. The software has been released on Github^{1}^{1}1https://github.com/jhwjhw0123/ImbalanceXGBoost and PyPI^{2}^{2}2https://pypi.org/project/imbalancexgboost/, and it has started to attract users with considerable downloads.
The rest of the paper goes as follows. Section 2 introduces the package from the perspectives of toolkit designers and users; Section 3
provides the theoretical foundation of the secondorder approximation of gradient boosting trees used in XGBoost and the first and secondorder derivatives of the two losses; Some related studies are surveyed and discussed in section
4, and the performances of the package are empirically examined on Parkinson’s disease diagnosis data in section 5; And finally, section 6 gives a general conclusion of the paper.2 Design and Usage of ImbalanceXGBoost
2.1 Code Design
Though the XGBoost method has implementations in multiple languages, Python is picked as the languageofchoice for its wide recognition and application in data science. The codes follow the standard of PEP8, and the project has been designed as opensource with codes on the Github page. The authors strive to keep the program consistent with ’standard’ practices in Pythonbased data science, as this can make it easier for users to get familiar with the package as well as to integrate it into their own projects. The input data is designated as Numpy array[16], but by explicitly adding np.array() conversion, data types compatible with Numpy array (e.g. Pandas Dataframe[18]) will also work on the package. As a small project, the usage of it can be clearly presented with the Readme file, and there is no additional documentation required.
The overall program is consist of three classes: one main class imbalance_xgboost, which contains the method the users will be applying, and two customizedloss classes, Weight_Binary_Cross_Entropy and Focal_Binary_Loss, on which the imbalanced losses are based. The loss functions are designed as separate classes for the convenience of parameter tuning, and they are not supposed to be called by the users. When initializing an imbalance_xgboost object, keyword special_objective will be recorded by the __init__() method. Then, when executing the fit() function, the corresponding loss function object will be instantiated with the given parameter, and the builtin xgb.train() method of XGBoost will be able to fit the model based on the customized loss function. Figure 1 illustrates the overall structure of the program.
Listing 1 demonstrates a sample usage of the package to fit a dataset without parameter tuning. It could be observed from the listing that the type of XGBoost is specified during the instantiation of the object, while parameters are fed when calling the fitting function. The fitting function also has a exception handling mechanism: if the corresponding parameter ( or ) is not provided for a specific type of special_objective, a ValueError will be raised with the information that the essential parameter is missing.
As it has been stated before, the package is designed to be an estimator class of the
Scikitlearn toolkit. This scheme enables the model and parameter selection methods in Scikitlearn, such as GridsearchCV() and RandomizedSearchCV(), to be directly applied to find the best parameters for imbalanced XGBoosts. In practical data science tasks, this feature is crucial as the optimal models rely heavily on parameter tuning and selection. Also, estimator in Scikitlearn can be combined with other estimators (transformers) by integrating them to a Pipeline object[17]. This allows the weighed and focalXGBoost to be easily combined with other preprocessing methods, such as resampling, to produce more robust results. Section 2.2 will provide more details for the package to tune parameters and perform crossvalidation with Scikitlearn.Table 1 provides the major methods/functions to be used in this package. They can be categorized into three groups: model fitting, model prediction, and evaluation scores. The ’basic’ methods are formed by overriding functions in Scikitlearn estimators (e.g. fit()), and some methods for extensions and variations are named in a ’literal’ style (e.g. predict_sigmoid()
). To offer a ’downward compatible’ solution, the package also allows users to call vanilla XGBoost by not specifying the objective function. The output of the method, by default, will be ’raw logits’ without being processed by the Sigmoid function. Thus, the evaluation functions have been modified accordingly. Multiple evaluation functions are provided for a purpose of convenient evaluation, and details of evaluation functions are provided in section
2.3.Function  Description  

Model Fitting  fit  train the XGBoost model 
Model Prediction  predict  predict the raw logits(without Sigmoid transformation) 
predict_sigmoid  predict the Sigmoid output ()  
predict_determine  predict the label (0 or 1)  
predict_two_classes  predict the label with onehot encoding 

Evaluation Score  score  overriding accuracy score 
score_eval_func  flexible evaluation score with multiple metrics  
correct_eval_func  collecting prediction correctness for crossvalidation 
2.2 Model Optimization and Evaluation with ScikitLearn
Listing 2 illustrates an example of parameter tuning and crossvalidation with Scikitlearn for ImbalanceXGBoost. Similar to common classifiers in Scikitlearn, the best classifier/parameter can be obtained through exhaustive or random search with the functions GridsearchCV(). It is noticeable that after fitting the model, it is possible to retrieve the ’plain’ booster by accessing member opt_booster.booster, and the object will be a XGBoost class (instead of ImbalanceXGBoost class). This makes it possible for the user to train the model on a machine where ImbalanceXGBoost is available, save the model as ’plain’ XGBoost, and run on another machine where only the original XGBoost package is installed.
The crossvalidation evaluation part is, like the parameter tuning, very similar to an ’ordinary’ classifier, and most of the usage guideline can be found from the documentation of ScikitLearn. The only part to notice here is that listing 2 actually provides a combination of parameter selection and crossvalidation evaluation, and a new booster is instantiated after the best parameters are obtained. The reason for not using the optimal booster provided by GridsearchCV() is that one wants the XGBoost to be trained from a randomized state to ensure a fair evaluation.
2.3 Builtin Evaluation Score
As one can observe in table 1, there are three evaluation functions in the package. The overriding score() function serves the purpose to evaluate prediction accuracy under the format of predictions, which are presigmoid values (in range ) by default, by wrapping the sigmoid transformation and accuracy checking together. In comparison, function score_eval_func() is the method to return metrics other than accuracy. In labelimbalanced binary classification, accuracy cannot reliably reveal the performance quality on its own as the metric can be ’tricked’ by predicting all the instances as the majority class. This type of prediction will lead to high accuracy, yet the classifier actually does nothing. Thus, metrics taking ’preciseness’ into accounts, such as precision, recall, score and Matthew’s Correlation Coefficient(MCC)[19], are often applied for the scenario[20]. Function score_eval_func() provides implementations for all the metrics mentioned above, and it can be overloaded by specifying the partial argument ’mode’(which can be accomplish by functools.partial()).
In the cases of leaveoneout/leavefewout cross validation, any metric other than accuracy will likely become illdefined. For example, for the precision metric in leaveoneout cross validation, if the prediction is 0 for the single instance, then it is meaningless to compute the ’precision’. In such situation, one will instead wish to collect the classification correctness of each prediction, sum up the evaluations, and compute the metrics with the confusion matrix based on every ’test instance’. To make this possible, the package provides function
correct_eval_func in the program. The function can be overloaded by the ’mode’ argument, and the four choices TP, FP, TN and FN represent True Positive, False Positive, True Negative and False Negative, respectively. It is noticeable that the four methods should be used simultaneously to produce a complete confusion matrix, and a wrapper to combine them into one function can be an extension of the package in the future.3 Theories and Derivatives
In this section, the mathematical foundations and derivations for the loss functions to be applied are discussed. For a highlevel introduction, since XGBoost adopts an additive learning scheme with a secondorder approximation, the firstorder derivative (shorthanded as ’gradient’) and secondorder derivative (noted as ’hessian’ although somehow a misnomer) of the loss functions with respect to the prediction are required for fitting the model. To illustrate a clear mechanism, the section will first review the secondorder approximation of additive tree boosting in section 3.1. Subsequently, the derivatives of gradients and hessians of the weighted and focal losses will be discussed in section 3.2 and 3.3, respectively.
The notations used in this section will be as follows. We use to denote the number of data and for the number of features. The ’raw prediction’ before the sigmoid function will be denoted as , and the probabilistic prediction will be , where is used to represent the sigmoid function. It is important to keep in mind that there is a discrepancy between the notations of this paper and the original XGBoost literature([1]), as the in their analysis is denoted as here. is used to denote the true label, and and are used for the parameters for the two loss functions, respectively. The expressions of the gradients/hessians are noted in a merged format independent from the value of
, as this can simplify the program implementation and help vectorization in other related programs.
3.1 Secondorder Approximation of Gradient Boosting Tree
According to [1], the additive learning objective used in practice is:
(1) 
where denotes the th iteration of the training process. Notice that the replacement of the notations has been applied in the equation. Applying secondorder Taylor expansion on equation 1, one will get:
(2) 
The last line comes from the fact that the term can be removed from the learning objective as it is unrelated to the fitting of the model in the th iteration. In equation 2, there are and , which are the ’gradient’ and ’hessian’ terms mentioned before. Notice that both and are scalars, as individual boosting trees only deal with binary problems. Multiclass classification tasks are usually processed by an ensemble of binary classification trees (socalled onevsall scheme)[21][22]. This is also the reason why the authors think the terms are somehow used as misnomers.
Since XGBoost does not provide automatic differentiation, the handderived derivatives will be essential. Meanwhile, the derived expressions have further potentials to be applied into other machine learning tasks. Therefore, the derivatives are discussed in sections 3.2 and 3.3. For both loss functions, sigmoid is selected as activation, and the following basic property of sigmoid will be consistently used in the derivatives:
(3) 
3.2 Weighted Crossentropy Loss
The weighted crossentropy loss for binary classification can be denoted as follows:
(4) 
where indicates the ’imbalance parameter’. Intuitively, if is greater than 1, extra loss will be counted on ’classifying 1 as 0’; On the other hand, if is less than 1, the loss function will weight relatively more on whether data points with label 0 are correctly identified.
The first order derivative is presented as follows:
(5) 
The derivative is similar with the term for ordinary crossentropy loss. A notable difference is that a term is added to control the present of the parameter.
Taking derivative with respect to again, one will get the secondorder derivative:
(6) 
After plugging equation 3 to the derivation.
3.3 Focal Loss
According to [12], the binary focal loss can be denoted as:
(7) 
As one can observe, if one sets , the equation will become ordinary crossentropy loss. Taking equation 3 into consideration, the first derivative of the focal loss can be denoted as:
(8) 
And if is set to 0 in equation 8, the derivative will be the same as crossentropy loss. The equation follows a clear structure, but it is still lengthy. To simplify the expression, one can set the following shorthand variables:
(9) 
Plugging these representations into equation 8, the firstorder derivative can be denoted as:
(10) 
Finally, taking derivatives with respect to , and combining with equation 3 and 10, one can get the secondorder derivative (’hessian’), which can be denoted as:
(11) 
Again, if , the secondorder derivative becomes , which matches the formula of the secondorder derivative of ordinary crossentropy.
4 Related Work
The paper is built on the foundation of the original papers of XGBoost[1] and focal loss[12], and the methodology to program customized loss function is provided in the software’s Github page^{1}^{1}1https://github.com/dmlc/xgboost/blob/master/demo/guidepython/custom_objective.py. XGBoost is based on the algorithm of gradient tree boosting[23], and this method has been deemed as a powerful Machine Learning technique long before the XGBoost was born[24]. Besides XGBoost, there are other implementations of gradient boosting, such as pGBRT[25], LightGBM[26], and CatBoost[27]
. Some of the implementations have additional features and are able to outperform XGBoost on some specific problems, but XGBoost remains the methodofthefirstchoice in the data science community at large. As for the recently proposed focal loss, studies related to it are usually affiliated with Neural Networks and Deep Learning
[28][29][30]. The loss function is usually applied in an endtoend manner with automatic differentiation, and to the best of the authors’ knowledge, there has not been any notable publication comprehensively discussing the derivatives of the loss function (despite the firstorder derivative was briefly discussed and presented in another form in the original paper[12]).Previous applications of XGBoost in labelimbalanced scenarios focus mostly on datalevel algorithms. For example, (Kabir et al., 2018)[13] applies several commonlyused data resampling methods before using XGBoost for labelimbalanced breast cancer classification, and (He et al., 2018)[31] utilized a more advanced undersampling method called BalanceCascade[32] with XGBoost for credit scoring. Among the limited number of publications discussing algorithmlevel modification for XGBoost in imbalanced classification, (Xia et al., 2017)[14] used a a prior modification of the sigmoid activation to achieve a better result, but the loss function was unchanged. As it has been mentioned in section 1, (Chen et al., 2017)[15]
is by far the only implementation explicitly applied weighted function to XGBoost to best of the authors’ knowledge. It is noticeable that a Tensorflowbased gradient boosting implementation called
Tf Boosted Trees[33] is able to run with the loss functions without the derivatives provided in this paper as it has an automatic differentiation mechanism. Nevertheless, it is a less popular package without supports of largescale Machine Learning and compatibility with Scikitlearn toolkit.As a common issue frequently encountered in practice, labelimbalanced classification has been intensively studied by researchers and there are multiple existing software programs designed to handle the problem. For a great example, (Lemaitre et al., 2017)[34] provides an integrated Python package called Imbalancedlearn for datalevel resampling for imbalanced classification, and it has similar counterparts in the regime of other programming languages, such as ROSE in R[35]. It is worth noting that the Imbalancedlearn package can be considered as an extension of Scikitlearn, and the Machine Learning toolkit itself also provides elementary methods to deal with labelimbalanced problems[17]. Other software programs concerning labelimbalanced classification include popular Data Mining toolkits, such as KEEL[36] and WEKA[37]. In addition, (Zhang et al., 2019)[38] provides a software containing a set of algorithms specifically for multiclass labelimbalanced problems, serving as one of the most recent studies on this topic.
5 Experiments
In this section, experimental results based on Parkison’s disease classification data are discussed. Experimental results suggest that the special XGBoost methods implemented in the package outperform best existing approaches known to the authors on the same task, and the pattern of the predictions can meet the expectations. The dataset and experiment setup will be first discussed in section 5.1, and results and discussions will be presented in section 5.2.
5.1 Dataset and Setup
To lead out, the Parkinson’s Disease(PD) classification data^{1}^{1}1available publicly, url: https://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification is first introduced[39]. As a recentcollected dataset with 757 features categorized into 7 specific groups (two originally separate groups, Bandwidth and Formant, are merged in experiments), the data was gathered from 188 Parkinson’s disease patients and 64 healthy individuals at the Department of Neurology in Cerrahpaşa Faculty of Medicine, Istanbul University [39]. Each individual corresponds to 3 records, and due to the differences between the number of participants of the two sides, a label imbalance ratio of 188:64 (roughly 3:1) emerges.
As a new dataset, the best known classification results are demonstrated in [39] with seven traditional classification algorithms and two ensemble approaches. It is noticeable that the figures reported in [39] are less strong than existing best performances concerning Parkinson’s disease. The authors of that paper provided an explanation that since the experiments were conducted with leaveoneobjectout cross validation, the classification task becomes more challenging as the information of the same object (person) is no way to be found in the training data (different from leaveonerecordout). To keep consistent with the original system, in the setup of the experiments of this paper, the same cross validation technique is applied. Furthermore, the results in [39] illustrate a high accuracy and relatively lower score, indicating that the classifiers failed to tell the two classes clearly and likely achieved the performance by overwhelming predicting the majority class. This is an unfavourable behavior in labelimbalanced classification, and as one will see in the following sections, one advantage of ImbalanceXGBoost is that it does not suffer from this problem.
As mentioned, parameters and will affect the performances of weighted and focal loss, and a parameter search is often deemed necessary in Machine Learning models. Therefore, in our experiments, grid search is applied through the GridsearchCV() of Scikitlearn to explore the optimal models. The searching range of is set to and parameter is selected from the candidacies of . Notice that is set to less than in the experiments since the number of data points with ’1’ label (patients) are the majority class of the dataset. To conduct the leaveoneobjectout cross validation, the correctness collection function mentioned in section 2.3 is applied. By collecting results of True_Positive (TP), True_Negative (TN), False_Positive (FP), and False_Negative (FN), the confusion metric can be obtained and accuracy and score are computed accordingly. The records are evaluation in a perrecord manner, which means the 3 records of one object (patient/healthy individual) will be evaluated individually, and 3 counts of the correctness will be added.
5.2 Classification Results and Discussion
Accuracy and score of the test set with 6 sets of features are presented in Table 2, where Best in [39] indicates the best performance of accuracy and score retrieved from the paper.
Baseline features  MFCC  Wavelet features  

Accuracy  score  Accuracy  score  Accuracy  score  
Best in [39]  0.79  0.75  0.84  0.83  0.78  0.74 
WeightedXGBoost  0.76  0.85  0.80  0.87  0.75  0.85 
FocalXGBoost  0.76  0.85  0.82  0.89  0.75  0.85 
Bandwidth + Formant  Intensity Based  Vocal FoldBased  
Accuracy  score  Accuracy  score  Accuracy  score  
Best in [39]  0.77  0.72  0.77  0.74  0.77  0.74 
WeightedXGBoost  0.74  0.85  0.75  0.85  0.75  0.84 
FocalXGBoost  0.75  0.85  0.75  0.85  0.76  0.85 
Without exception, a slight declination of accuracy could be observed in weightedXGBoost and focalXGBoost, but both classifiers generate a significantly higher score. The increase of score and the decrements of accuracy suggest that the previousobtained higher accuracy is a consequence of overlooking minority class, so it is reasonable for our classifier to appear to ’sacrifice’ accuracy in order to guarantee impartial recognition results on both classes. Furthermore, for almost all the feature groups, the highest score is obtained by focalXGBoost. This observation can be explained from an algorithmic perspective that focal loss is more robust to parameters, while weighted loss is prone to the effect of suboptimal parameters even if parameter search is applied.
To eliminate potential impacts on the classification performance due to intrinsic characteristics of individual sets of features, a classifier with 50 topranked features selected by mRMR (minimum RedundancyMaximum Relevance)
[40] was applied in [39] as well. The feature selection method is based on the principle of maximizing the joint dependency of top ranking variables on the targeted one by reducing the redundancy among them [40] [41]. For a comparison purpose, this paper employs the same technique with provided Python interface^{1}^{1}1https://github.com/fbrundu/pymrmr, and produces a subset of top50 features to run with ImbalanceXGBoost. The performance of weighted and focalXGBoost on the top50 features can be observed in table 3.Top 50 Features  

Accuracy  score  
Best in [39]  0.86  0.84 
WeightedXGBoost  0.82  0.88 
FocalXGBoost  0.83  0.89 
Consistent with the performance on individual groups of features, focalXGBoost classifier has the highest score, slightly better than weightedXGBoost. Both weighted and focalXGBoost outperform best classifier in [39] by a large margin, and since the top50 feature can be regarded as a ’master subset’, the superiority of the methods implemented in imbalanceXGBoost can be further corroborated.
6 Conclusion
This paper presents a novel Pythonbased package, namely ImbalanceXGBoost, for binary labelimbalanced classification with XGBoost. The package implemented weighted crossentropy and focal loss functions on XGBoost, and it is fully compatible with the popular Scikitlearn package in Python. The design and usage of the package are introduced, and the discussion of methods and code listing examples provide a clear and comprehensive user guidance of the package. The theories and derivatives essential to the package are further discussed, and experiments based on Parkinson’s disease classification data are conducted with stateoftheart performances illustrated. Overall, the package demonstrated in this paper successfully combines XGBoost with popular labelimbalancerobust loss functions and provides one of the most competitive performances up to date.
In summary, this paper has made three main contributions. Firstly, the paper has introduced a novel package that leverages the power of weighted and focal loss function for XGBoost, and it has huge potentials to be applied to a variety of reallife binary classification problems. Secondly, the paper has studied the theoretical foundations of the secondorder approximation of XGBoost and has provided essential derivatives for the loss functions to be applied. The derivatives can also be applied to other fields in Machine Learning, and the equations in the merged form are convenient to be vectorized. And finally, the paper has offered new stateoftheart performances on the Parkison’s disease classification data, and the emphasis of the imbalanced nature provides new a perspective to study the dataset.
In the future, the authors plan to keep maintaining the package and improving the quality of it by adding new features and further optimizing the codes. The software is opensource, and every member of the community is welcomed to contribute their own revisions of the program. Furthermore, the authors intend to add more evaluation score functions in the coming versions. An extension to multiclass labelimbalanced classification problems can be the plan of the longer future.
Acknowledgment
The authors would like to express thanks to Github users named icegrid and shaojunchao for reporting and correcting errors in the previous versions, and Github users with IDs olivierverdier and braingineer for providing latex solutions for code listings. In addition, Noel Chao of Dow Jones provided writing suggestions and proofreading for the paper, and the authors would like to express thanks to her.
References
 [1] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
 [2] ChingWei Wang, YuChing Lee, Evelyne Calista, Fan Zhou, Hongtu Zhu, Ryohei Suzuki, Daisuke Komura, Shumpei Ishikawa, and ShihPing Cheng. A benchmark for comparing precision medicine methods in thyroid cancer diagnosis using tissue microarrays. Bioinformatics, 34(10):1767–1773, 2017.
 [3] Chen Wang, Suzhen Wang, Fuyan Shi, and Zaixiang Wang. Robust propensity score computation method based on machine learning with labelcorrupted data. arXiv preprint arXiv:1801.03132, 2018.
 [4] YungChia Chang, KueiHu Chang, and GuanJhih Wu. Application of extreme gradient boosting trees in the construction of credit risk assessment models for financial institutions. Applied Soft Computing, 73:914–920, 2018.
 [5] Jyotsna Talreja Wassan, Haiying Wang, Fiona Browne, and Huiru Zheng. A comprehensive study on predicting functional role of metagenomes using machine learning methods. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 16(3):751–763, 2019.
 [6] Kamil Belkhayat Abou Omar. Xgboost and lgbm for porto seguro’s kaggle challenge: A comparison. Preprint Semester Project, 2018.
 [7] Didrik Nielsen. Tree boosting with xgboostwhy does xgboost win "every" machine learning competition? Master’s thesis, NTNU, 2016.
 [8] Zhixun Zhao, Hui Peng, Chaowang Lan, Yi Zheng, Liang Fang, and Jinyan Li. Imbalance learning for the prediction of n 6methylation sites in mrnas. BMC genomics, 19(1):574, 2018.
 [9] Ruisen Luo, Songyi Dian, Chen Wang, Peng Cheng, Zuodong Tang, YanMei Yu, and Shixiong Wang. Bagging of xgboost classifiers with random undersampling and tomek link for noisy labelimbalanced data. In IOP Conference Series: 3rd International Conference on Automation, Control and Robotics Engineering (CACRE 2018), volume 428, page 012004. IOP Publishing, 2018.

[10]
Yanmin Sun, Andrew KC Wong, and Mohamed S Kamel.
Classification of imbalanced data: A review.
International Journal of Pattern Recognition and Artificial Intelligence
, 23(04):687–719, 2009. 
[11]
Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang.
Learning deep representation for imbalanced classification.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 5375–5384, 2016.  [12] TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
 [13] Md Faisal Kabir and Simone Ludwig. Classification of breast cancer risk factors using several resampling approaches. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 1243–1248. IEEE, 2018.
 [14] Yufei Xia, Chuanzhe Liu, and Nana Liu. Costsensitive boosted tree for loan evaluation in peertopeer lending. Electronic Commerce Research and Applications, 24:30–49, 2017.
 [15] Wenbin Chen, Kun Fu, Jiawei Zuo, Xinwei Zheng, Tinglei Huang, and Wenjuan Ren. Radar emitter classification for large data set based on weightedxgboost. IET Radar, Sonar & Navigation, 11(8):1203–1207, 2017.
 [16] Stefan Van Der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 13(2):22, 2011.
 [17] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
 [18] Wes McKinney. pandas: a foundational python library for data analysis and statistics. Python for High Performance and Scientific Computing, 14, 2011.
 [19] Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)Protein Structure, 405(2):442–451, 1975.
 [20] David Martin Powers. Evaluation: from precision, recall and fmeasure to roc, informedness, markedness and correlation. Journal of Machine Learning Technologies, 2011.
 [21] Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research, 1(Dec):113–141, 2000.
 [22] Günther Eibl and KarlPeter Pfeiffer. Multiclass boosting for weak classifiers. Journal of Machine Learning Research, 6(Feb):189–210, 2005.
 [23] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
 [24] Alexey Natekin and Alois Knoll. Gradient boosting machines, a tutorial. Frontiers in neurorobotics, 7:21, 2013.
 [25] Stephen Tyree, Kilian Q Weinberger, Kunal Agrawal, and Jennifer Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387–396. ACM, 2011.

[26]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei
Ye, and TieYan Liu.
Lightgbm: A highly efficient gradient boosting decision tree.
In Advances in Neural Information Processing Systems, pages 3146–3154, 2017.  [27] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pages 6638–6648, 2018.
 [28] Xiaoliang Wang, Peng Cheng, Xinchuan Liu, and Benedict Uzochukwu. Focal loss dense detector for vehicle surveillance. In 2018 International Conference on Intelligent Systems and Computer Vision (ISCV), pages 1–5. IEEE, 2018.
 [29] Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raska. Deepglobe 2018: A challenge to parse the earth through satellite images. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 172–181. IEEE, 2018.
 [30] Tianqi Zhang, LiYing Hao, and Ge Guo. A feature enriching object detection framework with weak segmentation loss. Neurocomputing, 335:72–80, 2019.
 [31] Hongliang He, Wenyu Zhang, and Shuai Zhang. A novel ensemble method for credit scoring: Adaption of different imbalance ratios. Expert Systems with Applications, 98:105–117, 2018.
 [32] XuYing Liu, Jianxin Wu, and ZhiHua Zhou. Exploratory undersampling for classimbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2008.
 [33] Natalia Ponomareva, Soroush Radpour, Gilbert Hendry, Salem Haykal, Thomas Colthurst, Petr Mitrichev, and Alexander Grushetsky. Tf boosted trees: A scalable tensorflow based framework for gradient boosting. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 423–427. Springer, 2017.
 [34] Guillaume Lemaître, Fernando Nogueira, and Christos K Aridas. Imbalancedlearn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research, 18(1):559–563, 2017.
 [35] Nicola Lunardon, Giovanna Menardi, and Nicola Torelli. Rose: A package for binary imbalanced learning. R journal, 6(1), 2014.
 [36] Jesús AlcaláFdez, Alberto Fernández, Julián Luengo, Joaquín Derrac, Salvador García, Luciano Sánchez, and Francisco Herrera. Keel datamining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of MultipleValued Logic & Soft Computing, 17, 2011.
 [37] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
 [38] Chongsheng Zhang, Jingjun Bi, Shixin Xu, Enislay Ramentol, Gaojuan Fan, Baojun Qiao, and Hamido Fujita. Multiimbalance: An opensource software for multiclass imbalance learning. KnowledgeBased Systems, 174:137–143, 2019.
 [39] C Okan Sakar, Gorkem Serbes, Aysegul Gunduz, Hunkar C Tunc, Hatice Nizam, Betul Erdogdu Sakar, Melih Tutuncu, Tarkan Aydin, M Erdem Isenkul, and Hulya Apaydin. A comparative analysis of speech signal processing algorithms for parkinson’s disease classification and the use of the tunable qfactor wavelet transform. Applied Soft Computing, 74:255–263, 2019.
 [40] Hanchuan Peng, Fuhui Long, and Chris Ding. Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence, 27(8):1226–1238, 2005.
 [41] Salvador García, Julián Luengo, and Francisco Herrera. Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. KnowledgeBased Systems, 98:1–29, 2016.