A Distributionally Robust Boosting Algorithm

Distributionally Robust Optimization (DRO) has been shown to provide a flexible framework for decision making under uncertainty and statistical estimation. For example, recent works in DRO have shown that popular statistical estimators can be interpreted as the solutions of suitable formulated data-driven DRO problems. In turn, this connection is used to optimally select tuning parameters in terms of a principled approach informed by robustness considerations. This paper contributes to this growing literature, connecting DRO and statistics, by showing how boosting algorithms can be studied via DRO. We propose a boosting type algorithm, named DRO-Boosting, as a procedure to solve our DRO formulation. Our DRO-Boosting algorithm recovers Adaptive Boosting (AdaBoost) in particular, thus showing that AdaBoost is effectively solving a DRO problem. We apply our algorithm to a financial dataset on credit card default payment prediction. We find that our approach compares favorably to alternative boosting methods which are widely used in practice.



There are no comments yet.


page 9


Unified Robust Boosting

Boosting is a popular machine learning algorithm in regression and class...

Recursive Bias Estimation and L_2 Boosting

This paper presents a general iterative bias correction procedure for re...

SelfieBoost: A Boosting Algorithm for Deep Learning

We describe and analyze a new boosting algorithm for deep learning calle...

Distorted English Alphabet Identification : An application of Difference Boosting Algorithm

The difference-boosting algorithm is used on letters dataset from the UC...

Statistical Guarantees and Algorithmic Convergence Issues of Variational Boosting

We provide statistical guarantees for Bayesian variational boosting by p...

Enhancing User' s Income Estimation with Super-App Alternative Data

This paper presents the advantages of alternative data from Super-Apps t...

Boosting Classifiers with Noisy Inference

We present a principled framework to address resource allocation for rea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Distributional robustness in decision making under uncertainty is an important topic with a long and successful history in the Operations Research literature [2, 22, 3, 13]

. In recent years, this topic has been further fueled by distributionally robust optimization (DRO) formulations applied to machine learning and statistical analysis. These formulations have been shown to produce powerful insights in terms of interpretability, parameter tuning, implementation of computational procedures and the ability to enforce performance guarantees.

For instance, the work of [28] studied connections between the statistical theory of Empirical Likelihood and distributionally robust decision making formulations. These connections have been helpful in order to obtain statistical guarantees in a large and important class of data-driven DRO problems.

The work of [15]

showed how DRO can be used to establish a connection to regularized logistic regression, therefore explaining the role of regularization in the context of improving out-of-sample performance and data-driven decision-making.

The works of [6, 4]

further show that square-root Lasso, support vector machines, and other estimators such as group-Lasso can be recovered exactly from DRO formulations. In turn, in


, it is shown that exploiting the DRO formulation leads to an optimality criterion for choosing the associated regularization parameter in regularized estimators which is both by robustness and statistical principles. Importantly, such criterion is closely aligned with a well-developed theory of high-dimensional statistics, but the criterion can be used even in the context of non-linear decision making problems.

The work of [12]

shows that variance regularization estimation can be cast in terms of a DRO formulation which renders the implementation of such estimator both practical and amenable to rigorous statistical and computational guarantees.

The list of recent papers that exploit DRO for the design of statistical estimators, revisiting classical ideas to improve parameter tuning, interpretation or implementation task is rapidly growing (see in [6, 4, 5, 8, 7, 14, 37, 21, 12]).

This paper contributes to these rapidly-expanding research activities by showing that DRO can naturally be used in the context of boosting statistical procedures. In particular, we are able to show that Adaptive Boosting (AdaBoost) can be viewed as a particular example of DRO-Boosting with a suitable loss function and size of uncertainty, see Corollary


These connections, as indicated earlier, are useful to enhance interpretability. Further, as we explain, DRO provides a systematic and disciplined approach for designing boosting methods. Also, the DRO formulation can be naturally used to fit tuning parameters (such as the strength used to weight one model versus another in the face of statistical evidence) in a statistically principled way.

Finally, we provide easy-to-implement algorithms which can be used to apply our boosting methods, which we refer to as DRO-Boosting procedures.

As an application of our proposed family of procedures, we consider a classification problem in the context of a credit card default payment prediction problem. We find that our approach compares favorably to alternative boosting methods which are widely used in practice.

The rest of this paper is organized as follows. In Section 2 we provide a general discussion about boosting algorithms. This section also introduces various notations which are useful to formulate our problem. Section 3 contains a precise description of the algorithm and the results validating the correctness of the algorithms. We discuss in this section also a connection to AdaBoost. Section 4 contains statistical analysis for the optimal selection of tuning parameters (such as the weight assigned to different models in the presence of available evidence). In Section 5, we show the result of our algorithms in the context of an application to credit card default payments. In Section 6, we summarize the proofs of technical results.

2. General Background and Problem Formulation

We first start by describing a wide class of boosting algorithms. For concreteness, the reader may focus on a standard supervised classification framework. The ultimate goal is to predict the label associated with a predictive variable . We then discuss our DRO problem formulation.

2.1. General Notions for Boosting

Suppose that at our disposal we have a set

of different classifiers. Typically,

will contain finitely many elements, but the DRO-Boosting algorithm can be implemented, in principle, in the case of an infinite dimensional , assuming a suitable Hilbert-space structure, as we shall discuss once we present our algorithms. We will keep this in mind, but to simplify the exposition, let us assume that contains finitely many elements, say . However, we will point out the ways in which the infinite dimensional case can be considered.

The different elements in are called classifiers, also known as predicting functions or “learners”. (Throughout our discussion we shall use learner to refer to a given predicting function or a classifier.) For simplicity, let us assume that the class of learners has been trained with independent data sets.

So, based on the learner , we may have a decision rule to decide on the predicted label given the observation or predictor . For example, such rule may by simply take the form (where

) or, instead, the classifier may estimate the probability that the label is

based on a probabilistic model which takes as input , in the context of logistic regression, which attaches probability proportional to to the label being and to the label being .

The idea of boosting is to combine the power of the learners in to create a stronger learner. [36] showed that the idea of combining relatively weak predictors or classifiers into a strong one having desirable probably approximately correct (PAC) guarantees [41] is feasible. Due to these appealing theoretical guarantees, boosting algorithms have attracted substantial attention in the community, first by considering static learning algorithms (e.g. [36, 17] ) and, also, inspired by Littlestone and Warmuth’s reweighting algorithm, by considering adaptive algorithms, such as AdaBoost, which will be reviewed in the sequel. AdaBoost was proposed by Freund and Schapire [19] as an ensemble learning approach to classification problems, which adaptive combine multiple weak learners.

In the context of classification problems, many boosting algorithms can be viewed as a weighted majority vote of all the weak learners (where the weights are computed relative to the information carried by each of the weak learners). However, boosting type algorithms are not only applicable to the classification problems, they can also be used to combine the prediction of simple regressors to improve the prediction power in regression of non-categorical data, see for example [11].

We will now focus our discussion on finding boosting learners of the form where is the linear span generated by the class . In particular, if ,

More generally, we may assume that contains elements in a subspace of a Hilbert space and then is the closure (under the Hilbert norm) of finitely many linear combinations of the elements in .

As a consequence of our results, we will show that a suitable DRO formulation applied to the family of boosting learners can be used to systematically produce adaptive algorithms which can be connected to well-known procedures such as AdaBoost.

Now, let us assume that we are given a data set . The variable corresponds to the predictor and the outcome corresponds to the classification output or the label. As indicated earlier, let us assume that for concreteness. We use to denote the empirical measure associated with the data set . In particular, where is a point measure (or delta measure) centred at . We use to denote the expectation operator associated to the measure . So, in particular, for example, for any continuous and bounded , we have that

We introduce a loss function which can be used to measure the quality of our learner, namely, . Let us now discuss examples of loss functions of interest.

Given a learner , and an observation , the margin corresponding to the observation is defined as . A loss function is said to be a margin-type-cost-function if there exist a function such that . In the context of regression problems, a popular loss function is the squared loss, namely, .

A natural approach to search for a boosting learner is to solve the optimization problem


The choice of the function to define the loss function is relevant to endow the boosting procedure with robustness characteristics (see, for example, [18, 35]). While we focus on robustness as well, our approach is different. We take the loss function as a given modeling input and, as we shall explain, robustify the boosting training procedure by quantifying the impact of distributional perturbations in the distribution . Our procedure could be applied in combination of the choice of a given function

which is chosen to mitigate outliers, for example, as it is the case typically in standard robustness studies in statistics (see

[26]). This combination may introduce a double layer of robustification and it may be an interesting topic of future research, but we do not pursue it here.

Viewing the search of a best linear combination of functions from a family of weaker learners to minimize the training loss, as in (1), can be understood in the vein of functional gradient descent [20]. For example, the AdaBoost algorithm is an instance of the functional gradient descent algorithm with margin cost function .

The connection between functional gradient descent and AdaBoost will be further discussed in Section 3.

2.2. DRO-Boosting Formulation

The DRO formulation associated to (1) takes the form


where is the distributional uncertainty set, centred around and the size of uncertainty is .

The motivation for the introduction of the inner maximization relative to (1) is to mitigate potential overfitting problems inherent to the training problems using directly empirical risk minimization (i.e. formulation (1)). The data-driven DRO approach (2) tries to address overfitting by alleviating the focus solely on the observed evidence. Rather than empirical risk minimization, our DRO procedure is finding the optimal decision that performs uniformly well around the empirical measure, where the uniform performance is evaluated within the distributional uncertainty set.

The data-driven DRO framework has been shown to be a valid procedure to improve the generalization performance for many machine learning algorithms. In the context of linear models, some of the popular machine learning algorithms with good generalization performance, for instance, regularized logistic regression, support vector machine (SVM), square-root LASSO, group LASSO, among others, can be recovered exactly in terms of a DRO formulation such as (1) (see [5, 14, 4, 7]). In these applications, the set is defined in terms of the Wasserstein distance (see also [9, 15, 21, 42]).

Other application areas of data-driven DRO which have shown promise because of good empirical performance and theoretical guarantees include reinforcement learning, graphic modeling, deep learning, etc, have been shown to perform well empirically (see

[39, 16, 40]).

Finally, we also mention formulations of the distributional uncertainty set based on moment constraints, for instance

[10, 24].

Our focus on this paper is on the use of the Kullback-Leibler (KL) divergence in order to describe the set . We choose to use the KL divergence because we want to establish the connection to well-know boosting algorithms, and we wish to further add their interpretability in terms of robustness. But we emphasize, as mentioned earlier, that a wide range of DRO formulations can be applied. Formulations of problems such as (1) based on the KL and related divergence notions have been studied in the recent years, see [31, 38, 1, 23, 27, 28].


denote the set of all probability distributions with support on

. (Any element is a random measure because is a random set, but this issue is not relevant to implement our DRO-Boosting algorithm, we just consider as given. For statistical guarantees, this issue is relevant, and it will be dealt with in the sequel.)

For any distribution , we define the weight vector such that for . In particular, for the empirical distribution we have . Note that the set is isomorphic to the standard simplex

For any distribution , the KL divergence between and , denoted by (also known as the Relative Entropy or the Empirical Likelihood Ratio (ELR)) is defined as

It is a well-known consequence of Jensen’s inequality that , and if and only if . The distributional uncertainty set that we consider is defined via

By substituting the definition of the uncertainty set into (2), the DRO-Boosting model based on KL uncertainty sets is well defined given the loss function, the class , and the choice of , which will be discussed momentarily.

3. Main Results

In this section, we propose an algorithm based on functional gradient descent to solve the DRO-Boosting (2). As we shall see, similar to the AdaBoost algorithm as in [20], fitting the DRO-Boosting algorithm exhibits a procedure that alternates between reweighting the worst case probability distribution and updating the predicting function . Therefore, the connection between DRO-Boosting and AdaBoost will be naturally established. Moreover, since choosing different types of distributional uncertainty sets, , potentially based on different notions of discrepancies between and alternative distributions results in different reweighting regimes, it can be speculated that our proposed DRO-Boosting framework is more flexible than AdaBoost.

For ease of notation and introduction of our functional gradient descent method, in the rest of this paper, we define two functionals. For any index , the empirical loss functional is defined as The robust loss functional is defined as

Due to the fact that any finitely supported distribution, , is characterized by its associated weights, , we can introduce the weight uncertainty set defined as

as an isomorphic counterpart of the distributional uncertainty set .

Thus, using the right hand side weight vector and the weight uncertainty set , the functional can be rewritten as


We denote by the set of all maximizers to the above optimization problem, i.e.

Using these definitions, the DRO-Boosting formulation (2) admits an alternative expression, namely,


consequently it suffices to find a best predicting function that minimizes . In order to guarantee that the optimization problem (4) has optimal solution, we impose the following assumption.

Assumption 1 (Convex Loss Functional).

The empirical loss functional is assumed to be convex and continuous. In addition, there exist , such that the level set is compact.

Next, to simplify the exposition, we impose an assumption which guarantees that the data is rich enough relative to the class . We will discuss how can one proceed if this assumption is violated.

Assumption 2 (Separation).

For any two different predicting functions , there exist at least one such that . In other words, the observed data is rich enough to distinguish or separate two different learners in .

Observation 1: There are natural situations in which the Separation Assumption may fail to hold. For example, if is a regression tree, namely, where the ’s are disjoint sets forming a partition on the domain of , then when considering we may violate the Separation Assumption may not hold. Nevertheless, we can define an equivalence relationship “” defined via if and only if for all , and then work with equivalence classes instead. Moreover, since the dimension of the quotient space is at most , we may assume that is a finite dimensional space without loss of generality. To ease the exposition, we will state the assumption is finite dimensional next.

Assumption 3 (Finite Representation).

We assume that a finite dimensional basis exists for (or has been extracted for ). So, we can write (or ) where form linearly independent functions .

If the dimension of is infinite, then will grow with and this will have consequences for the rate of convergence of the algorithm and the statistical analysis of the uncertainty set selection. But this is not important to implement and run the algorithms that we will present below.

We now are ready to summarize some properties of in Lemma 1 and thereafter develop a subgradient descent algorithm to find the optimal robust decision rule by solving (2).

Lemma 1.

If Assumption 1 is imposed, then the robust loss functional is convex. In addition, the set of optimizer is guaranteed to be nonempty.

The predicting function controls the value of solely via . Thus, to derive the functional gradient descent algorithm, one needs to construct a metric on where the structure is determined only by as well. To this end, we consider the inner product , where

Due to the Assumption 2, is a well defined inner product on space , and we denote by the induced norm of on . As the dimension of space can be assumed to be finite due to Observation 1, the space endowed with inner product is a Hilbert space. Note that the topology induced by is isomorphic to the Euclidean topology in .

In order to formalize the functional subgradient algorithm, we first introduce the definition of functional subgradient and functional sub-differential.

Definition 1 (Functional Sub-Gradient (Gradient)).

For a convex functional and , a linear functional , (i.e. ) is called the functional subgradient of at if

The functional sub-differential of at , denoted by , is defined as

If the functional sub-differential set is a singleton set, then the functional is said to be differentiable at , in which case the only element in is denoted by , called the functional gradient of at . Intuitively, one may regard the functional gradient as the response of an infinitesimal change in the predicting function .

Proposition 1.

Suppose that Assumptions 1-3 are imposed, then the functional sub-differential is given by


Here, for sets and , denotes the convex hull of set , and denotes the Minkowski sum of sets and .

Corollary 1.

Suppose that Assumptions 1-2 are imposed, then if for some functional we have and for each , then is also differentiable at and

Using Proposition 1 and Corollary 1, we can make sense of the functional sub-differential or even the functional gradient . However, in order to ensure the trajectory of functional gradient descent lies in the space , one wants to find a weak linear in to approximate the functional gradient . This simply means finding a best approximation in such that

Using these observation, the functional subgradient descent algorithm for solving the DRO-Boosting problem (2) is given in Algorithm 1.

Note that in Algorithm 1, we have to compute a worst case probability weight for some functional . If , we can simply pick . Otherwise, the worst case probability can be computed using the Lemma 2.

Lemma 2.

For and , define


Suppose that there exist such that , then is strictly increasing in . Furthermore, if we set , where is the unique root of , then .

Corollary 2 shows that Algorithm 1 exactly recovers the AdaBoost algorithm proposed in [19]. The proof of Corollary 2 is elementary.

Corollary 2 (Connection to AdaBoost).

Suppose that and

then and the worst case probability in Lemma 2 is given by .

1:Initialize weight and .
2:for  to  do
3:     Update : With fixed , we compute a subgradient using Proposition 1.
4:     Find a best approximation to in class and get the update direction .
5:     Apply line search for step size ; update the function as .
6:     Update : With fixed , we can evaluate the loss function for the observations as
7:     Apply Newton-Raphson Method to compute the unique root of of and update the worst case probability according to Lemma 2.
8:We terminate the algorithm if there is no improvement could be make or we reach the maximal step size.
9:Output .
Algorithm 1 DRO-Boosting
Remark 1 (Convergence Analysis).

Under the technical assumption that has Lipschitz continuous gradient, the convergence of Algorithm 1 follows from Theorem 2 of [30]. This assumption is typically violated in our setting. However, by introducing a soft maximum approximation to , as in Theorem 2 of [8], one may apply the results of [30] directly, at the expense of a small (user-controlled) error. We tested empirically the convergence of the algorithm, successfully. The smoothing analysis will appear elsewhere.

Now, note that Algorithm 1 requires supplying the parameter . This parameter is typically chosen using cross-validation. However, when , with fixed, we can use the work of [12, 28], which establishes a connection to Empirical Likelihood to obtain . This will be discussed in Section 4.

4. Optimal Selection of the Distributional Uncertainty Size

We now explain how to choose using statistical principles and avoiding cross-validation. The strategy is to invoke the theory of Empirical Likelihood, following the approach in [28, 5], as we shall explain. We assume that the data is i.i.d. from an underlying distribution , which is unknown. We use to denote the product measure governing the distribution of the whole sequence . Moreover, in this section we will assume, in Assumption 3, that where is fixed and independent of .

For each probability measure , there is an associated optimal predicting function as the minimizer to the risk minimization problem We assume the convexity and smoothness for the loss function as discussed previously in Section 3. Following the criterion discussed in [5], we define The set can be interpreted as a confidence region for the functional parameter and we are interested in minimizing the size of the confidence region, , while guaranteeing a desired level of coverage for the confidence region . In other words, we want to choose as the solution to the problem


where is a desired confidence level, say 95%, which implies choosing . This problem is challenging to solve, but we will provide an asymptotic approximation to it as .

We make the following assumptions to proceed the discussion.

Assumption 4 (First Order Optimality Condition).

The optimal choice is characterized via the corresponding first-order optimality condition, which yields

By the definition of the functional gradient in Definition 1, we know .

The derivation of the asymptotic results relies on central limitation theorem, and we assume the existence of the second moment for the functional gradient.

Assumption 5 (Square Integrability).

We assume that

In order to compute according to (7) we will provide a more convenient representation for the set . For any , we can define a set of probability measures for which is an optimal boosting learner, that is, if . Consequently,

Define the smallest Kullback-Leibler discrepancy between and any element of via


It is immediate to see that if and only if there exists an element in such that . Consequently, we have


The asymptotic analysis of the object

, known as the Empirical Profile Likelihood (EPL) has been studied extensively. From (9) we have that the solution to (7) is precisely the quantile of . Therefore, to approximate , it suffices to estimate the asymptotic distribution of .

We apply the techniques in [28] to derive the asymptotic distribution of as , and pick the quantile of the asymptotic distribution as our uncertainty size.

Figure 1 gives an illustration of the equivalence in (9

). The emphasized line corresponds to the Kullback-Leibler divergence between the empirical distribution and the manifold


Figure 1.

Pictorial illustration for the geometric meaning of the ELP function.

The asymptotic results for (8) is similar as introduced in the literature for the empirical likelihood theorem as in [32, 34, 33]

. There show that the asymptotic distribution is chi-square and the degree of freedom depends on the estimating equation. We state the results in Theorem

1 below.

Theorem 1 (Generalized Empirical Likelihood Theorem).

We assume that our loss function is smooth, the functional gradient is well defined, and Assumption 4 and Assumption 5 hold. Then we have the EPL function defined in (8

) has asymptotic chi-square distribution, i.e.


where is the chi-square distribution with degree of freedom equal and stands for weak convergence.

We can notice the rate of convergence is free of the data dimension and the dimension of the dimension of the functional space. Theorem 1 is a generalization of the empirical likelihood theorem for general estimating equation introduced in [34].

5. Numerical Experiments

In this section, we apply our DRO-Boosting algorithm to predict default in credit card payment. We take the credit-card default payment data of Taiwan from UCI machine learning database [29]. The data has 23 predictors and the response is binary stands for non-default and default. There are 30000 observations in the data set with 6636 defaults and 23364 non-defaults.

We take the binary classification tree as the weak learner as the basis function, and we consider exponential loss function for model fitting. We compare our model with the AdaBoost algorithm, which is the state-of-art boosting algorithm for practical consideration.

Every time, we randomly split the data into a training set, with 3000 observations, and a testing set, with the rest 27000 data points. We train the model on the training set as we illustrated in Algorithm 1. We consider the basis function as 5-layer classification tree, and we assume the dimension (or effective degree of freedom) of the functional space is roughly 30. To pick the uncertainty set, we apply the method introduced in Section 4, where we pick the level of uncertainty to be . We consider a more complex basis model, a 5-layer classification tree, as the weak learner, which is mainly due to we try to explore a case where the AdaBoost model is more likely to overfit the data.

We report the accuracy, true positive rate, false negative rate, and the exponential loss in Table 1, where we can observe superior performance for the DRO-Boosting model on the testing set.

Training Set Testing Set
Algorithm AdaBoost DRO-Boosting AdaBoost DRO-Boosting
False Negative Rate
True Positive Rate
Average Exponential Loss
Table 1. Numerical results for applying DRO-Boosting to credit card default payment prediction.

We can observe from our numerical experiment that the DRO formalization helps improve the performance on the testing set. The worst-case expected loss function tires to mimic the testing error and avoid overemphasis on the training set. Thus we observe higher training error using DRO, while the advantage is reflected in the testing error.

6. Proofs of the Technical Results

The proofs are presented in the order as they appear in the paper.

Proof of Lemma 1.

For the first argument, each is a convex functional due to Assumption 1, so is their convex combination . After taking maximization over , the functional is still convex. For the second argument, the objective function on the right hand side of (3) is continuous in and the feasible region is compact, so the optimizer set is guaranteed to be nonempty. ∎

Proof of Proposition 1.

According to [25] Chapter D, Theorem 4.1.1, we have

for all . In addition, note that each is convex and upper semi-continuous according to Assumption 1, so , as a convex combination of , is also convex and upper semi-continuous. Furthermore, the set is a compact set. Consequently, using [25], Chapter D, Theorem 4.4.2, the desired result (5) is proved. ∎

Proof of Corollary 1.

The result follows from Theorem 1 and [25], Chapter D, Corollary 2.1.4. ∎

Proof of Lemma 2.

The strict monotonicity of the fucntion can be shown by taking derivative and applying Cauchy Schwartz inequality. Using the fact that is the root of , it follows that . The optimality of is proved by verifying the Karush–Kuhn–Tucker conditions of the convex optimization problem . ∎

Proof of Theorem 1.

The EPL function (8) is a convex optimization with constraint. We can write the optimization problem in the Lagrange form as where is the Lagrange multiplier. The optimization problem could be solved by its first order optimality condition, and it gives and We denote , , and . Then we apply Lemma 11.1 and Lemma 11.2 in [33], we have


Therefore, for the ELP function, we have

The first two equations are by definition, the equation four, five and six are applying (11) and (12), while the final equation is the central limitation theorem for the estimating equation. ∎