1 Introduction
Gradient boosted decision trees (GBDT) [16] is one of the most popular machine learning algorithms as it provides highquality models in a large number of machine learning problems containing heterogeneous features, noisy data, and complex dependencies [31]. There are many fields where gradient boosting achieves stateoftheart results, e.g., search engines [34, 3], recommendation systems [30], and other applications [36, 4].
One problem of GBDT is the computational cost of the learning process. GBDT may be described as an iterative process of constructing decision tree models, each of which estimates negative gradients of examples’ errors. At each step, GBDT greedily builds a tree. GBDT scores every possible feature split and chooses the best one, which requires computational time proportional to the number of data instances. Since most GBDT models consist of an ensemble of many trees, as the number of examples grows, more learning time is required, what imposes restrictions on using GBDT models for large industry datasets.
Another problem is the tradeoff between the capacity of the GBDT model and its generalization ability. One of the most critical parameters that influence the capacity of boosting is the number of iterations or the size of the ensemble. The more components are used in the algorithm, the more complex dependencies can be modeled. However, an increase in the number of models in the ensemble does not always lead to an increase in accuracy and, moreover, can decrease its generalization ability [28]. Therefore boosting algorithms are usually provided with regularization methods, which are needed to prevent overfitting.
A common approach to handle both of the problems described above is to use random subsampling [17] (or bootstrap) at every iteration of the algorithm. Before fitting a next tree, we select a subset of training objects, which is smaller than the original training dataset, and perform learning algorithm on a chosen subsample. The fraction of chosen objects is called sample rate. Studies in this area show that this demonstrates excellent performance in terms of learning time and quality [17]. It helps to speed up the learning process for each decision tree model as it uses less data. Also, the accuracy can increase because, despite the fact that the variance of each component of the ensemble goes up, the pairwise correlation between trees of the ensemble decreases, what can lead to a reduction in the total variance of the model.
In this paper, we propose a new approach to theoretical analysis of random sampling in GBDT. In GBDT, random subsamples are used for evaluation of candidate splits, when each next decision tree is constructed. Random sampling decreases the size of the active training dataset, and the training procedure becomes more noisy, what can entail a decrease in the quality. Therefore, sampling algorithm should select the most informative training examples, given a constrained number of instances the algorithm is allowed to choose. We propose a mathematical formulation of this optimization problem in SGB, where the accuracy of estimated scores of candidate splits is maximized. For every fixed sample rate (ratio of sampled objects), we propose a solution to this sampling problem and provide a novel algorithm Minimal Variance Sampling (MVS). MVS relies on the distribution of loss derivatives and assigns probabilities and weights with which the sampling should be done. That makes the procedure adaptive to any data distribution and allows to significantly outperform the state of the art SGB methods, operating with way less number of data instances.
2 Background
In this section, we introduce necessary definitions and notation. We start from the GBDT algorithm, and then we describe its two popular modifications that use data subsampling: Stochastic Gradient Boosting [17] and GradientBased OneSide Sampling (GOSS) [24].
2.1 Gradient Boosting
Consider a dataset sampled from some unknown distribution . Here
is a vector from the
dimensional vector space. Value is the response to the input (or target). Given a loss function
, the problem of supervised learning task is to find function
, which minimizes the empirical risk:Gradient boosting (GB) [16] is a method of constructing the desired function in the form
where is the number of iterations, i.e., the amount of base functions chosen from a simple parametric family , such as linear models or decision trees with small depth. The learning rate, or step size in functional space, is denoted by . Base learners are learned sequentially in the following way. Given a function , the goal is to construct the next member of the sequence such that:
(1) 
Gradient boosting constructs a solution of Equation 1 by calculating firstorder derivatives (gradients) of at point and performing a negative gradient step in the functional space of examples . The latter means that is learned using as targets and is fitted by the least–squares approximation:
(2) 
If a subfamily of decision tree functions is taken as a set of base functions (e.g., all decision trees of depth 5), the algorithm is called Gradient Boosted Decision Trees (GBDT) [16]. A decision tree divides the original feature space into disjoint areas, also called leaves, with a constant value in each region. In other words, the result of a decision tree learning is a disjoint union of subsets and a piecewise constant function . The learning procedure is recursive. It starts from the whole set as the only region. For every of the already built regions, the algorithm looks out all split candidates by one feature and sets a score for each split. The score is usually a measure of accuracy gain based on target distributions in the regions before and after splitting. The process continues until a stopping criterion is reached.
Besides the classical gradient descent approach to GBDT defined by Equation 2, we also consider a secondorder method based on calculating diagonal elements of hessian of empirical risk .
The rule for choosing the next base function in this method [8] is:
(3) 
2.2 Stochastic Gradient Boosting
Stochastic Gradient Boosting is a randomized version of standard Gradient Boosting algorithm. Motivated by Breiman’s work about adaptive bagging [2], Friedman [17] came to the idea of adding randomness into the tree building procedure by using a subsampling of the full dataset. For each iteration of the boosting process, the sampling algorithm of SGB selects random objects without replacement and uniformly. It effectively reduces the complexity of each iteration down to the factor of . It is also proved by experiments [17] that, in some cases, the quality of the learned model can be improved by using SGB.
2.3 Goss
SGB algorithm makes all objects to be selected equally likely. However, different objects have different impacts on the learning process. Gradientbased oneside sampling (GOSS) implements an idea that objects with larger absolute value of the gradient are more important than the ones that have smaller gradients. A large gradient value indicates that the model can be improved significantly with respect to the object, and it should be sampled with higher probability compared to welltrained instances with small gradients. So, GOSS takes the most important objects with probability 1 and chooses a random sample of other objects. To avoid distribution bias, GOSS reweighs selected samples by setting higher weights to the examples with smaller gradients. More formally, the training sample consists of instances with largest with weight equal to 1 and of instances from the rest of the data with weights equal to .
3 Related work
A common approach to randomize the learning of GBDT model is to use some kind of SGB, where instances are sampled equally likely or uniformly. This idea was implemented in different ways. Originally, Friedman proposed to sample a subset of objects of a fixed size [17] without replacement. However, in today’s practice, other similar techniques are applied, where the size of the subset can be stochastic. For example, the objects can be sampled independently using a Bernoulli process [24], or a bootstrap procedure can be applied [14]. To the best of our knowledge, GOSS proposed in [24] is the only weighted (nonuniform) sampling approach applied to GBDT. It is based on intuitive ideas, but its choice is empirical. Therefore our theoretically grounded method MVS outperforms GOSS in experiments.
Although, there is a surprising lack of nonuniform sampling for GBDT, there are [13] adaptive weighted approaches proposed for AdaBoost, another popular boosting algorithm. These methods mostly rely on weights of instances defined in the loss function at each iteration of boosting [15, 35, 9, 26, 20]. These papers are mostly focused on the accurate estimation of the loss function, while subsamples in GBDT are used to estimate the scores of candidate splits, and therefore, sampling methods of both our paper and GOSS are based on the values of gradients. GBDT algorithms do not apply adaptive weighting of training instances, and methods proposed for AdaBoost cannot be directly applied to GBDT.
One of the most popular sampling methods based on target distribution is Importance Sampling [37]
widely used in deep learning
[21]. The idea is to choose the objects with larger loss gradients with higher probability than with smaller ones. This leads to a variance reduction of minibatch estimated gradient and has a positive effect on model performance. Unfortunately, Importance Sampling poorly performs for the task of building decision trees in GBDT, because the score of a split is a ratio function, which depends on the sum of gradients and the sample sizes in leaves, and the variance of their estimations all affect the accuracy of the GBDT algorithm. The following part of this paper is devoted to a theoretically grounded method, which overcomes these limitations.4 Minimal Variance Sampling
4.1 Problem setting
As it was mentioned in Section 2.1, training a decision tree is a recursive process of selecting the best data partition (or split), which is based on a value of some feature. So, given a subset of original feature space , split is a pair of feature and its value such that data is partitioned into two sets: , . Every split is evaluated by some score, which is used to select the best one among them.
There are various scoring metrics, e.g., Gini index and entropy criterion [32] for classification tasks, mean squared error (MSE) and mean absolute error (MAE) for regression trees. Most of GB implementations (e.g. [8]) consider hessian while learning next tree (secondorder approximation). The solution to in a leaf is the constant equal to the ratio of the sum of gradients and the sum of hessian diagonal elements. The score of a split is calculated as
(4) 
where is the set of obtained leaves, and leaf consists of objects that belong to this leaf. This score is, up to a common constant, the opposite to the value of the functional minimized in Equation 3 when we add this split to the tree. For classical GB based on the firstorder gradient steps, according to Equation 2, score should be calculated by setting in Equation 4.
To formulate the problem, we first describe the general sampling procedure, which generalizes SGB and GOSS. Sampling from a set of size
may be described as a sequence of random variables
, where , and indicates that th example was sampled and should be used to estimate scores of different candidates . Let be the number of selected instances. By sampling with sampling ratio , we denote any sequence , which samples of data on average:(5) 
To make all key statistics (sum of gradients and sum of hessians in the leaf) unbiased, we perform inverse probability weighting estimation (IPWE) [18], which assigns weight to instance . In GB with sampling, score is approximated by
(6) 
where the numerator and denominator are estimators of and correspondingly.
We are aimed at minimization of squared deviation , under the assumption that previous splits of the tree are fixed and the same for subsampled and full data. Deviation is a random variable due to the randomness of the sampling procedure (randomness of ). Therefore, we consider the minimization of the expectation .
Theorem 1.
The expected squared deviation can be approximated by
(7) 
where , , and is the value in the leaf that would be assigned if would be a terminal node of the tree.
The proof of this theorem is available in the Supplementary Materials.
The term in Equation 7 has an upper bound of . Using Theorem 1, we come to an upper bound minimization problem
(8) 
Note that we do not have the values of for all possible leaves of all possible candidate splits in advance, when we perform sampling procedure. A possible approach to Problem 8 is to substitute all by a universal constant value, which is a parameter of sampling algorithm. Also, note that is and is up to constants that do not depend on the sampling procedure. In this way, we come to the following form of Problem 8:
(9) 
4.2 Theoretical analysis
Here we show that Problem 9 has a simple solution and leads to an effective sampling algorithm. First, we discuss its meaning in the case of firstorder optimization, where we have .
The first term of the minimized expression is responsible for gradient distribution over the leaves of the decision tree, while the second one is responsible for the distribution of sample sizes. Coefficient controls the magnitude of each of the component. It can be seen as a tradeoff between the variance of a single model and the variance of the ensemble. The variance of the ensemble consists of individual variances of every single algorithm and pairwise correlations between models. On the one hand, it is crucial to reduce individual variances of each model; on the other hand, the more dissimilar subsamples are, the less the total variance of the ensemble is. This is reflected in the accuracy dependence on the number of sampled examples: the slight reduction of this number usually leads to increase in the quality as the variance of each model is not corrupted a lot, but, when the sample size goes down to smaller numbers, the sum of variances prevails over the loss in correlations and the accuracy dramatically decreases.
It is easy to derive that setting to 0 implies the procedure of Importance Sampling. As it was mentioned before, the applicability of this procedure in GBDT is constrained since it is still important to estimate the number of instances accurately in each node of the tree. Besides, Importance Sampling is suffering from numerical instability while dealing with small gradients close to zero, what usually happens on the latter gradient boosting iterations. In this case, the second part of the expression may be interpreted as a regularisation member prohibiting enormous weights.
Setting to implies the SGB algorithm.
For arbitrary general solution is given by the following theorem (we leave the proof to the Supplementary Materials):
Theorem 2.
There exists a value such that is a solution to Problem 9.
For abbreviations, everywhere below, we refer to the expression using regularized absolute value term. The number defined above is a threshold for decision, whether to pick an example deterministic of by coin flipping. From the solution we see, that for any data instance, the weight is always bounded by some number, so the estimator is more computationally stable than IPWE usually is.
4.3 Algorithm
Now we are ready to derive the MVS algorithm from Theorem 2, which can be directly applied to general scheme of Stochastic Gradient Boosting. First, for given sample rate , MVS finds the threshold to decide, which gradients are considered to be large. Example with regularized absolute value higher than chosen is sampled with probability equal to 1. Every object with small gradient is sampled independently with probability and is assigned weight . Still, it is not apparent how to find such a threshold that will give the required sampling ratio .
A bruteforce algorithm relies on the fact that the sampling ratio has an inverse dependence on the threshold: the higher the threshold, the lower the fraction of sampled instances. First, we sort the data by regularized absolute value in descending order. Note that now, given a threshold , the sampling ratio can be calculated as , where is the number of the first element in sorted sequence, which is less than . Then the binary search is applied to find a threshold with the desired property . To speed up this algorithm, the precalculation of cumulative sums of regularized absolute values for every is performed, so the calculation of sampling ratio at each step of binary search has time complexity. The total complexity of this procedure is , due to sorting at the beginning. To compare with, SGB and GOSS algorithms have complexity for sampling.
We propose a more efficient algorithm, which is similar to the quick select algorithm [27]. In the beginning, the algorithm randomly selects the gradient, which is a candidate to be a threshold. The data is partitioned in such a way that all the instances with smaller gradients and larger gradients are on the opposite sides of the candidate. To calculate the current sample rate, it is sufficient to calculate the number of examples on the larger side and the sum of regularized absolute values on the other side. Then, estimated sample rate is used to determine the side where to continue the search for the desired threshold. If the current sample rate is higher, then algorithms searches threshold on the side with smaller gradients, otherwise on the side with greater. Calculated statistics for each side may be reused in further steps of the algorithm, so the number of the operations at each step is reduced by the number of rejected examples. The time complexity analysis can be carried out by analogy with the quick select algorithm, which results in complexity.
5 Experiments
Here we provide experimental results of MVS algorithm on two popular opensource implementations of gradient boosting: CatBoost and LightGBM.
CatBoost. The default setting of CatBoost is known to achieve stateoftheart quality on various machine learning tasks [29]. We implemented MVS in CatBoost and performed benchmark comparison of MVS with sampling ratio 80% and default CatBoost with no sampling on 153 publicly available and proprietary binary classification datasets of different sizes up to 45 millions instances. The algorithms were compared by the ROCAUC metric, and we calculated the number of wins for each algorithm. The results show significant improvement over the existing default: 97 wins of MVS versus 55 wins of default setting and mean ROCAUC improvement.
The source code of MVS is publicly available [6] and ready to be used as a default option of CatBoost algorithm. The latter means that MVS is already acknowledged as a new benchmark in SGB implementations.
LightGBM. To perform a fair comparison with previous sampling techniques (GOSS and SGB), MVS was also implemented in LightGBM, as it is a popular opensource library with GOSS inside. The MVS source code for LightGBM may be found at [19].
Datasets’ descriptions used in this section are placed in Table 1. All the datasets are publicly available and were preprocessed according to [5].
Dataset  # Examples  # Features 

KDD Internet [1]  10108  69 
Adult [25]  48842  15 
Amazon [23]  32769  10 
KDD Upselling [11]  50000  231 
Kick prediction [22]  72983  36 
KDD Churn [10]  50000  231 
Click prediction [12]  399482  12 
We used the tuned parameters and traintest splitting for each dataset from [5] as baselines, presetting the sampling ratio to 1. For tuning sampling parameters of each algorithm (sample rate and coefficient for MVS, large gradients fraction and small gradients fraction for GOSS, sample rate for SGB), we use 5fold crossvalidation on train subset of the data. Then the tuned models are evaluated on test subsets (which is 20% of the original size of the data). Here we use the score as an error measure (lower is better). To make the results more statistically significant, the evaluation part is run 10 times with different seeds. The final result is defined as the mean over these 10 runs.
Here we also introduce hyperparameterfree MVS algorithm modification. Since
(see Equation 9) is an approximation of squared mean leaf value upper bound, we replace it with a squared mean of the initial leaf. As it will be shown, it achieves nearoptimal results and dramatically reduces time spent on parameter tuning. Since it sets adaptively at each iteration, we refer to this method as MVS Adaptive.Quality comparison. The first experiments are devoted to testing MVS as a regularization method. We state the following question: how much the quality changes when using different sampling techniques? To answer this question, we tuned the sampling parameters of algorithms to get the best quality. This quality scores compared to baselines quality are presented in Table 2. From this results, we can see that MVS demonstrates the best generalization ability among given sampling approaches. The best parameter for MVS is about , it shows good performance on most of the datasets. For GOSS, the best ratio of large and small gradients varies a lot from the predominance of large to the predominance of small.
KDD Internet  Adult  Amazon  KDD Upselling  Kick  KDD Churn  Click  Average  

Baseline  0.0408  0.0688  0.1517  0.1345  0.2265  0.2532  0.2655  0.0% 
SGB  1.13%  +0.81%  1.14%  +0.03%  0.14%  +0.14%  0.14%  0.22% 
GOSS  0.64%  0.11%  1.23%  +0.07%  0.10%  +0.16%  0.09%  0.28% 
MVS  3.03%  0.24%  1.78%  0.07%  0.19%  +0.17%  0.04%  0.74% 
MVS Adaptive  2.79%  0.13%  1.57%  0.28%  0.19%  +0.07%  0.03%  0.70% 
Sample rate  0.02  0.05  0.1  0.15  0.2  0.25  0.3  0.35  0.4  0.5 

SGB  +19.92%  +11.35%  +6.83%  +4.99%  +3.84%  +3.03%  +2.17%  +1.57%  +1.10%  +0.42% 
GOSS  +22.37%  +12.75%  +8.00%  +5.32%  +3.39%  +2.25%  +1.41%  +0.75%  +0.23%  0.16% 
MVS  +13.93%  +7.76%  +3.69%  +1.91%  +0.74%  +0.14%  0.21%  0.43%  0.41%  0.45% 
MVS Adaptive  +13.72%  +7.47%  +3.71%  +1.70%  +0.55%  0.03%  0.07%  0.28%  0.32%  0.51% 
The next research question is whether MVS is capable of reducing sample size per each iteration needed to achieve acceptable quality. Furthermore, whether MVS is harmful to accuracy while using small subsamples. For this experiment, we tuned parameters, so that the algorithms achieve the baseline score (or their best score if it is not possible) using the least number of instances. Figure 1 shows the dependence of error on the sample size for two datasets and its confidence interval. Table 3 demonstrates average relative error change with respect to the baseline over all datasets used in this paper. From these results, we can conclude that MVS reaches the goal of reducing the variance of the models, and a decrease in sample size affects the accuracy much less than it does for other algorithms.
Learning time comparison. To compare the speedup ability of MVS, GOSS and SGB, we used runs from the previous experiment setting, i.e., parameters were chosen in order to have the smallest sample rate with no quality loss. Among them, we choose the ones which have the least training time (if it is impossible to beat baseline, the best score point is chosen). The summary is shown in Table 4, which demonstrates the average learning time gain relative to the baseline learning time (using all examples). One can see that the usage of MVS has an advantage in training time over other methods at the amount of about 10% for datasets presented in this paper. Also, it is important to mention that tuning the hyperparemeters is a main part of training a model. There is one common hyperparameter for all sampling algorithms  sample rate. GOSS has one additional hyperparameter  ratio of large and small gradients in the subsample, and MVS has a hyperparameter . So tuning GOSS and MVS may potentially take more time than SGB. But introducing MVS Adaptive algorithm dramatically reduces tuning time due to hyperparameterfree sampling procedure, and we can conclude from Tables 2 and 3 that it achieves approximately optimal results on the test data
Large datasets. Experiments with CatBoost show that regularization effect of MVS is efficient for any size of the data. But for large datasets it is more crucial to reduce learning time of the model. To prove that MVS is efficient in accelerating the training we use Higgs dataset [33] (11000000 instances and 28 features) and Recsys datasets [7] (16549802 instances and 31 features). The set up of experiment remains the same as in the previous paragraph. For Higgs dataset SGB is not able to achieve the baseline quality with less than 100% sample size, while GOSS and MVS managed to do this with 80% of samples and MVS was faster than GOSS (17.7% versus 8.5%) as it converges earlier. For Recsys dataset relative learning time differences are 50.3% for SGB (sample rate 20%), 39.9% for SGB (sample rate 20%) and 61.5% for MVS (sample rate 10%).
SGB  GOSS  MVS  

20.7%  20.4%  27.7% 
6 Conclusion
In this paper, we addressed a surprisingly understudied problem of weighted sampling in GBDT. We proposed a novel technique, which directly maximizes the accuracy of split scoring, a core step of the tree construction procedure. We rigorously formulated this goal as an optimization problem and derived a nearoptimal closedform solution. This solution led to a novel sampling technique MVS. We provided our work with necessary theoretical statements and empirical observations that show the superiority of MVS over the wellknown stateoftheart approaches to data sampling in SGB. MVS is implemented and used by default in CatBoost opensource library. Also, one can find MVS implementation in LightGBM package, and its source code is publicly available for further research.
Acknowledgements
We are deeply indebted to Liudmila Prokhorenkova for valuable contribution to the content and helpful advice about the presentation. We are also grateful to Aleksandr Vorobev for sharing ideas and support, Anna Veronika Dorogush and Nikita Dmitriev for experiment assistance.
References
 [1] (1998)(Website) Note: https://kdd.ics.uci.edu/databases/internet_usage/internet_usage.html Cited by: Table 1.
 [2] (1999) Using adaptive bagging to debias regressions. Technical report Technical Report 547, Statistics Dept. UCB. Cited by: §2.2.
 [3] (2010) From ranknet to lambdarank to lambdamart: an overview. Learning 11 (23581), pp. 81. Cited by: §1.
 [4] (2006) An empirical comparison of supervised learning algorithms. In Proceedings of the 23rd international conference on Machine learning, pp. 161–168. Cited by: §1.
 [5] (2018)(Website) Note: https://github.com/catboost/catboost/tree/master/catboost/benchmarks/quality_benchmarks Cited by: §5, §5, §C.
 [6] (2019)(Website) Note: https://github.com/catboost/catboost Cited by: §5.
 [7] (2015)(Website) Note: https://2015.recsyschallenge.com/challenge.html Cited by: §5.
 [8] (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794. Cited by: §2.1, §4.1.
 [9] (2011) On adaptive regularization methods in boosting. Journal of Computational and Graphical Statistics 20 (4), pp. 937–955. Cited by: §3.
 [10] (2009)(Website) Note: https://www.kdd.org/kddcup/view/kddcup2009/Data Cited by: Table 1.
 [11] (2009)(Website) Note: http://www.kdd.org/kddcup/view/kddcup2009/Data Cited by: Table 1.
 [12] (2012)(Website) Note: http://www.kdd.org/kddcup/view/kddcup2012track2 Cited by: Table 1.
 [13] (2018) Asynchronous parallel sampling gradient boosting decision tree. In arXiv preprint arXiv:1804.04659, pp. . Cited by: §3.
 [14] (2017) CatBoost: gradient boosting with categorical features support. Workshop on ML Systems at NIPS. Cited by: §3.
 [15] (2008) Stationary features and cat detection. In Journal of Machine Learning Research, pp. 2549–2578. Cited by: §3.
 [16] (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1, §2.1, §2.1.
 [17] (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis 38 (4), pp. 367–378. Cited by: §1, §2.2, §2, §3.
 [18] (1952) A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47 (260), pp. 663–685. Cited by: §4.1.
 [19] (2019)(Website) Note: https://github.com/ibr11/LightGBM Cited by: §5.
 [20] (2019) Faster boosting with smaller memory. In arXiv preprint arXiv:1901.09047, pp. . Cited by: §3.
 [21] (2018) Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems, pp. 7265–7275. Cited by: §3.
 [22] (2012)(Website) Note: https://www.kaggle.com/c/DontGetKicked Cited by: Table 1.
 [23] (2013)(Website) Note: https://www.kaggle.com/c/amazonemployeeaccesschallenge Cited by: Table 1.
 [24] (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, pp. 3149–3157. Cited by: §2, §3.
 [25] (1996)(Website) Note: https://archive.ics.uci.edu/ml/datasets/Adult Cited by: Table 1.
 [26] (2004) On the bayesrisk consistency of regularized boosting methods. The Annals of statistics 32 (1), pp. 30–55. Cited by: §3.
 [27] (1995) Analysis of quickselect: an algorithm for order statistics. RAIROTheoretical Informatics and ApplicationsInformatique Théorique et Applications 29 (4), pp. 255–276. Cited by: §4.3.
 [28] (2008) Evidence contrary to the statistical view of boosting. Journal of Machine Learning Research 9 (Feb), pp. 131–156. Cited by: §1.
 [29] (2018) CatBoost: unbiased boosting with categorical features. In Advances in Neural Information Processing Systems, pp. 6638–6648. Cited by: §5.
 [30] (2007) Predicting clicks: estimating the clickthrough rate for new ads. In Proceedings of the 16th international conference on World Wide Web, pp. 521–530. Cited by: §1.

[31]
(2005)
Boosted decision trees as an alternative to artificial neural networks for particle identification
. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 543 (2), pp. 577–584. Cited by: §1.  [32] (1999) Families of splitting criteria for classification trees. Statistics and Computing 9 (4), pp. 309–315. Cited by: §4.1.
 [33] (2014)(Website) Note: https://archive.ics.uci.edu/ml/datasets/HIGGS Cited by: §5.
 [34] (2010) Adapting boosting for information retrieval measures. Information Retrieval 13 (3), pp. 254–270. Cited by: §1.
 [35] (2008) Weighted sampling for largescale boosting. In BMVC, pp. 1–10. Cited by: §3.
 [36] (2015) A gradient boosting method to improve travel time prediction. Transportation Research Part C: Emerging Technologies 58, pp. 308–324. Cited by: §1.
 [37] (2015) Stochastic optimization with importance sampling for regularized loss minimization. In international conference on machine learning, pp. 1–9. Cited by: §3.
Appendices
A Proof of Theorem 1
Proof.
We estimate the expectation by representing as the function .
We use the firstorder Taylor series expansion of at the point , where and .
Without loss of generality, we further provide calculations for the case .
We have , and, therefore, .
Further, we have
∎
B Proof of Theorem 2
Proof.
Our goal is to find solution to the optimization problem:
(10) 
Lagrange function for this problem has form:
(11) 
Necessary conditions for solution of 10 are set by Karush–Kuhn–Tucker conditions:
(12) 
Analyzing these conditions, it is easy to conclude that optimal solution has the following properties.

Since every , .

If , then and .

If , then
Putting all together, there exists a threshold , which divides sample into two parts: of size with and of size with .
Therefore, it is sufficient to find , such that . Desired value of can be found as a solution of:
(13) 
Existence and uniqueness of solution for follows from the monotonous decrease of the left side of equation as a function of .
Setting finishes the proof. ∎
C Experiments
We use grid search with 5fold cross validation to find the best sampling parameters for each algorithm and sampling ratio. For MVS it is a logspace grid on for parameter and for large and small gradients ratio for GOSS. For other parameters we use tuned parameters from the publicly available benchmarks [5].
For the most visible demonstration of the superiority of MVS we place here charts of quality on sampling ratio dependence for every dataset from the main paper.
Comments
There are no comments yet.