The problem of distributionally robust learning has been an area of great interest in the machine learning community over the past few years. This class of problems includes a fundamental tradeoff between bias and variance, or equivalently between approximation error and estimation error. An important issue with distributionally robust learning concerns the scalability of the learning algorithms for very large datasets, especially since existing approaches are based on operating on the entire collection of data samples in each iteration. To address these fundamental issues, we propose and investigate a new stochastic gradient descent (SGD) algorithm to efficiently solve general large-scale distributionally robust learning optimization problems by sub-sampling the support of the decision variables. Out method does progressively increase this support so as to eventually cover the dataset, and we do so by optimally, in a strong statistical sense, balancing the computational effort with the required level of accuracy. Our approach supports a general class of distance measures as part of the robust formulation. We derive and establish various theoretical results for our approach using a combination of methods from mathematical optimization and mathematical statistics. We also present empirical results that demonstrate and quantify the significant benefits of our approach over previous work in the area. All proofs and additional technical materials are provided in the supplement.
1.1 Distributionally Robust Learning
Consider a general formulation of the distributionally robust optimization problem of active interest. Let denote a sample space,, and a parameter space. Define to be the expectation with respect to (w.r.t.)
of a loss functionrepresenting the estimation error for a learning model with parameters over data . Further define the expected worst-case loss function , which maximizes the loss over a well-defined set of measures that typically takes the form
where is a metric on the space of probability distributions on and where the constraints limit the feasible candidates to be within a distance of a base distribution, denoted by . We then seek to find a that, for a given and , solves the distributionally robust optimization problem
The solution to the above min-max formulation renders an expected loss performance that is robust w.r.t. taking any . Hence, (2) explicitly treats the identity of the true (unknown) data distribution, denoted by , as being ambiguous. Note that the likelihood of is generally quite high, and thus the loss at is likely to be higher than the loss at the optimal for , were it to be known; meanwhile, since this is rarely the case, still hedges the performance of a model to the uncertainty in . Note further that this entire approach is as opposed to only solving for , the loss performance under a fixed data distribution, often the empirical distribution.
The formulation in (2) w.r.t. the set in (1) captures numerous use cases with different metrics . In special cases of the definition of (1), the solution to the inner maximization problem (2) may be explicitly available. One example is based on specific instances of Wasserstein distance metrics, where the solution of the inner problem can be explicitly characterized, the objective function value is available in closed form, and (2) reduces to a standard stochastic optimization problem; refer to (Blanchet et al., 2016; Sinha et al., 2017) for examples along these lines. Our primary interest herein lies in the general -divergence class of distance measures
where is a non-negative convex function that takes a value of only at . While not explicitly characterizable, formulation (2) with (3) constraints yield efficient solution procedures (see Section 2.2).
In the case of a -metric, corresponding to a -divergence with , Namkoong and Duchi Namkoong and Duchi (2017) analyze the formulation (2) and establish its equivalence to variance regularization of the empirical risk minimization problem. Specifically, for convex, bounded loss functions with as the empirical distribution over a large dataset of size , the following result is shown to hold with high probability
Results in a similar vein have been obtained for other -divergence metrics (Lam, 2016; Duchi et al., 2016), most notably Kuhlback-Leibler (KL) divergence which uses . Namkoong and Duchi Namkoong and Duchi (2017) consider specific instances of the loss function where an appropriate choice of leads to an optimal solution that has loss performance within of the (unknown) true optimal ; meanwhile, the identified by minimizing leads to a solution with loss performance.
The formulation based on directly minimizing the variance regularized risk of the form (4) over is hard to solve because of the non-convexity of the second term, even if is strongly convex. On the other hand, the formulation in (2) is a convex problem in . This, in combination with the (possibly) better statistical properties of , makes it highly desirable to efficiently solve the general min-max formulation (2). The problem formulations of primary interest in this paper are such that the optimal solution and/or optimal function value of the inner maximization problem over cannot be obtained in closed form, which appears to be the case for the important general class of -divergence distance metrics in (3
). Define the vectorsand of dimension . We shall henceforth focus on the case where is the empirical probability mass function (pmf) over a dataset of size , and thus the loss function and constraint set are given by
where the convex conjugate of , , is known in closed form for various , such as those corresponding to - and -divergence. Since (5) is in the form of a standard stochastic minimization problem, Ben-Tal et al. (Ben-Tal et al., 2013) propose to apply classical SGD methods to compute its solution. However, as Namkoong and Duchi (Namkoong and Duchi, 2016) observe, the presence of in the denominator of the argument of causes SGD to become unstable as , which our experiments show is likely. An alternative approach is proposed in (Namkoong and Duchi, 2016) that interleaves one SGD step in the -space with a step in the -space. Such primal-dual steps are a result of applying stochastic mirror-descent to each set of variables. This yield a method that applies SGD-type iterations to a formulation with a composite dimension of . Each step requires solving convex proximal mapping optimization formulations, and the computational effort needed to do this makes it is desirable to avoid this significant expansion in dimension.
To this end, Namkoong and Duchi (Namkoong and Duchi, 2017) propose to determine the optimal that defines directly, namely solving the problem (2) as a large deterministic gradient descent problem. This is a feasible approach for specific choices of -divergences. For the case, Namkoong and Duchi (Namkoong and Duchi, 2017) show that the inner maximization can be reduced to a one-dimensional root-finding problem, which can be solved via bisection search. The key issue is that this bisection search still requires an amount of effort (see Proposition 2) at each iteration, which can be expensive.
1.2 Our Contributions
We propose a new primal descent algorithm to solve (2) that is applicable for various -divergence distance measures (3). In Section 2.1, we (slightly) generalize the (exact) bisection-search result in (Namkoong and Duchi, 2016, 2017) for the inner maximization problem by utilizing similar results derived in (Ghosh and Lam, 2018) for -divergences and showing that this general approach can be successfully applied to other -divergence metrics. This still yields a computational effort of order for -constrained (2) and for -constarints, where is the dimension of the decision variables, in our case the size of the support of the pmf. To address this issue, instead of operating with the complete dataset for all iterations of a gradient descent algorithm, we propose the following stochastic sub-gradient descent scheme
where is variously called the step-size or gain sequence or learning rate, is a relatively small subset of the full dataset having size , and is an approximation of .
The approximation is obtained by first uniformly sampling without replacement the subset of the complete dataset of size . This approximation solves the inner maximization over pmfs on this subset
. Sampling without replacement differs from the standard with-replacement approach in the stochastic optimization literature, though it is preferred by practitioners in machine/deep learning. Remark1 below describes why this strategy is needed here.
Defining of dimension , we more precisely have the formulation
The cost of solving this problem via bisection search is . Now suppose is an optimal solution to (7). Then the vector
is a valid sub-gradient for and thus we use it in (6).
With respect to the quality of the approximation of , or more particularly that of its sub-gradient, we provide a result in Section 2.3 on the rate at which the bias in the gradient estimation depends on the sample size . Since this estimator is unbiased and the only control on it is via , our method necessarily grows as . The result specifically depends on the being sampled without replacement, and sampling with replacement yields a much slower bias dropoff that makes the method computationally burdensome.
We look at sample size growth rules where the maximum size is hit after a (large but) finite number of iterations. In Section 2.4, we address the question of choosing a good sequence, and in particular balancing the added computational buden of each iteration against the expected reduction in optimality gap. We show for the strong-convex loss functions that too slow a growth sequence is inefficient, while geometrically growing sequences are efficient in the sense that the expected optimality gap drops at a rate proporionate to the increase in computational budget.
This paper only treats strongly convex losses , but our analysis of bias and convergence and substantial aspects of the rate of convergence can be extended (in the spirit of Pasupathy et al. (2018); Hashemi et al. (2017)) to the cases when are convex but not -strongly convex, or more importantly non-convex, e.g. training deep learning models. The algorithm proposed in Namkoong and Duchi (2016) appears to be limited to convex . This subject is the focus of our ongoing research.
2 Algorithm and Analysis
2.1 SGD Algorithm
Our dynamically sampled subgradient descent algorithm for efficiently solving distributionally robust learning optimization problems is presented in Algorithm 1. Here we fix and increase the sub-sampling set in a geometric manner so as to statistically cover the entire dataset. We will subsequently show that these parameter settings provide the desired statistical efficiency. The algorithm stops when ; in our experiments we proceed with the full gradient (deterministic) algorithm thereafter.
Our detailed analysis in the remainder of this section starts with an exact solution to the inner maximization problem, generalizing the bisection-search result in (Namkoong and Duchi, 2016, 2017). The final two subsections establish various mathematical properties for our approach w.r.t. bias and convergence, respectively.
2.2 Solving for and
Recall the inner optimization problem expressed as subject to , , . Following (7) and (8), we restrict the support of in Algorithm 1 to a given set of indices , and only allow , while setting the remaining elements as . We then define the restricted problem
The problem (9) states its target divergence value as ; in the subsequent sections we will prescribe specific values. Denote the optimal solution to (9) as and its objective value as ; the latter is an approximation for the robust objective . Defining of dimension and writing the Lagrangian objective of (9) as
we then have ; refer to (Luenberger, 1969). The equality constraint will always be satisfied; but the -divergence inequality may not satisfied as an equality, given the optimality direction , the constraints that , and a large enough so that the -divergence constraint allows the mass to accumulate at either of the bounds on . By complimentary slackness, we have the optimal in this case.
We will use the following general procedure to solve Lagrangian
formulations in the proofs of Proposition 1 and
Proposition 2. This has been followed by previous
work(Ben-Tal et al., 2013; Ghosh and Lam, 2018; Namkoong and Duchi, 2016, 2017),
either explicitly or in the same spirit:
Case: along with constraint .
Let and . Set in (10), and then observe that an optimal solution is where , and .
If , then stop and return .
Case: constraint with .
Keeping fixed, solve for the optimal (as a function of ) that maximizes , applying the constraint .
Keeping fixed, solve for the optimal using the first order optimality condition on . Note that this is equivalent to satisfying the equation . This step usually leads to a available in closed form; see the results below.
Apply the first order optimality condition to the one-dimensional function to obtain the optimal . This is equivalent to requiring that satisfy the equation . Substitute in and return it.
For the two results below, the last step of Procedure 1 turns out to involve solving a root finding problem, where the left hand summation is a (strictly) monotonic function of . We now apply this procedure to two specific -divergences, noting that the optimal value for many other -divergences can be obtained in a similar manner. Algorithm 2 presents the solution to the -divergence constrained problem.
The optimal solution to the problem (9) with a KL-divergence constraint (where )
is given by
. Case : , where , ;
. Case : , where solves and . The computational effort needed to solve this problem is , where is the desired accuracy.
Proof of Proposition 1: We first handle the case when the KL-divergence constraint is not tight and . Substituting this in (10) shows that any optimal solution places mass only within the set as defined. Consider any such , and let be the solution that assigns equal mass to the support points in . We then have
where we apply Jensen’s inequality to the convex . Thus, among all optimal solutions, obtains the smallest divergence, and hence is the best optimal candidate to meet the divergence constraint with slack. Note that this applies for any convex .
For the case when the -divergence constraint is tight using , we proceed according to the corresponding three steps in Procedure 1 above.
Setting to zero the gradient of with respect to , we obtain
This solution also satisfies the non-negativity constraint on .
Setting renders , which in turn yields
To obtain , substitute the into the divergence constraint satisfied as an equality. Then, after some algebra, we conclude that must satsify
Let and write . Then, finding is equivalent to obtaining the that satisfies . A unique root for this exists because the left hand expression is monotonic and takes on a value of at , and as . Hence, a bisection search will render the optimal .
An optimal solution to the problem (9) with a -divergence constraint (where ) is given by
. Case : , where , ;
. Case : where and jointly solve: and . Furthermore, the computational effort needed to obtain the primal-dual optimal solutions is , where is the estimation precision required.
First, order all the into the increasing sequence , where the notation denotes the index of the th smallest value. Additionally, define . Note that the objective function , and hence it is sufficient to maximize with respect to the (non-negative) vector .
Setting the gradient of with respect to to zero componentwise for each , we obtain
Let represent the index for which the following condition holds:
Let . The equality can be rewritten as
where and the index satisfy the bounds in (11). The first term is a lower semi-continuous decreasing step function of , with steps at the where for each ; recall that . The right hand side is an increasing function of . Hence, a unique exists that satisfies (12); we only need to check the mismatch at the breakpoints of the step-function to find this . A bisection search with computational effort of at most (as described in Algorithm 2) yields this point.
This last step requires the zero of the gradient of with respect to , or equivalently the that satisfies
The first term is a decreasing function of since and as . Hence a unique root exists, which can again be found via a binary search (see Algorithm 2). From (12), we know that when is large, the optimal and . Let and let the bisection algorithm search within where for some large constant .
The bisection for involves steps where is the precision required in solving (13), each of which takes steps to solve for the optimal pair. The overall computational complexity of solving for is therefore , where the second term arises from sorting the values into the vector .
To summarize, the optimization procedure to obtain the solution to the -divergence constrained problem is presented in Algorithm 2.
2.3 Small-sample Approximation of
Algorithm 1 is proposed in the spirit of SGD methods, in that it is unnecessary to obtain precise values for the gradient especially for the initial iterations in (6). We therefore construct a sub-gradient approximation in (8) to the full-gradient , where is the optimal solution to the full-data problem (2) and in (8) is the optimal solution to the restricted problem (7) based on uniformly sampling without replacement data points from the full data set.
The primary concern with this approach is the bias induced by the subsampling of the full support, which we show in Theorem 3 to be of order . We restrict our attention to -divergences that satisfy, for a small , the continuity condition
where and are both . This continuity condition can be verified for many common -divergence measures of interest including the and KL-divergence metrics. Let and be expectations and probabilities w.r.t. the uniform sampling without-replacement producing the random set .
Suppose the optimal solution to (2) is unique and in (1). Assume the -divergence satisfies (14) and define the -constraint target in (7) to be , where for constant and small constant . Then, for all with sufficiently large, we have that the sub-gradient and full-gradient satisfy as , where is a finite constant.
We first provide a sketch of the proof of Theorem 3, with the full details to follow. First construct , a restriction of the (unique) optimal solution of the full-data problem (2) onto the (random) subset of support points used in the restricted problem (9), where . The condition ensures that, with high probability, the summation in the denominator is greater than zero for a sufficiently large . We then show that, with high probability (under the -sampling measure), the pmf is a feasible solution to (9) when is inflated as assumed. Next, we establish that is of the order , where denotes the transpose of vector . Since is a feasible solution to (9), an appeal to the fundamental theorem of calculus yields the desired result. We extensively exploit the statistical properties of sampling a finite set without replacement, and therefore provide a brief summary here. Let be a set of one-dimensional values with and . Suppose we sample of these points uniformly without replacement to construct the set . The probability that any particular set of subsamples was chosen is . Denote by the expectation under this probability measure, and let and represent the sample mean and sample variance, respectively. We then know Wilks (1962) that
The second term, i.e., the expectation of the sample variance, shows that the sample variance is an unbiased estimate of the true variance. Further note that the third term, i.e., the variance of the sample mean, reduces to zero as .
We now start by addressing the feasibility of the restriction of the (unique) optimal solution of the full-data problem onto the (randomly sampled) subset .
Proof of Lemma 4: In the notation of sampling without-replacement introduced above, define a set of scalar values . We then have
By Chebychev’s inequality, the sample-average of an -subsample from this set satisfies
Hence, as , we have with probability at least that .
The condition ensures with high probability that the full data inner maximization (2) is tightly constrained by the constraint and a degenerate solution with (as in Case 1 of Procedure 1) does not apply. This lets us choose an such that for all . Rearranging , we obtain . Let as . Then the solution