Adaptive Sampling for Stochastic Risk-Averse Learning

10/28/2019 ∙ by Sebastian Curi, et al. ∙ 24

We consider the problem of training machine learning models in a risk-averse manner. In particular, we propose an adaptive sampling algorithm for stochastically optimizing the Conditional Value-at-Risk (CVaR) of a loss distribution. We use a distributionally robust formulation of the CVaR to phrase the problem as a zero-sum game between two players. Our approach solves the game using an efficient no-regret algorithm for each player. Critically, we can apply these algorithms to large-scale settings because the implementation relies on sampling from Determinantal Point Processes. Finally, we empirically demonstrate its effectiveness on large-scale convex and non-convex learning tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 13

page 14

page 15

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning systems are increasingly deployed in high-stakes applications. This imposes reliability requirements that are in stark discrepancy with how we currently train and evaluate these systems. Usually, we optimize expected performance both in training and evaluation via empirical risk minimization (Vapnik, 1992). Thus, we sacrifice occasional significant losses on “difficult” examples to perform well on average. In this work, we instead consider a risk-averse optimization criterion, namely the Conditional Value-at-Risk (CVaR), also known as the Expected Shortfall. This criterion has been used in many applications, such as portfolio optimization (Krokhmal et al., 2002) or supply chain management (Carneiro et al., 2010). In short, the -CVaR of a loss distribution is the average of the losses in the -tail of the distribution.

Unfortunately, we see in experiments that common variants of stochastic gradient descent (SGD) fail to optimize the CVaR in real-world data sets such as Fashion-MNIST and CIFAR-10. A possible reason for this failure is that Monte Carlo estimates of gradients of the CVaR have high variance.

To address this issue, we propose a novel adaptive sampling algorithm (Section 4). Our algorithm initially optimizes the mean of the losses but gradually adjusts its sampling distribution to increasingly sample tail events (difficult examples), until it eventually optimizes the CVaR (Section 4.2). Our approach naturally enables the use of standard stochastic optimizers (Section 4.3). We provide convergence guarantees of the algorithm (Section 4.4) and an efficient implementation (Section 4.5). Finally, we demonstrate the performance of our algorithm in a suite of experiments (Section 5).

2 Related Work

Risk measures

Risk aversion is a well-studied human behavior, in which agents assign more weight to adverse events than to positive ones (Pratt, 1978). There are three different methods to model risk: utility functions that enlarge larger losses (Rabin, 2013)

, prospect theory that re-scales the probability of events

(Kahneman and Tversky, 2013), or direct optimization of coherent risk-measures (Artzner et al., 1999). Rockafellar et al. (2000) introduce the CVaR as a particular case of the latter class. The CVaR is ubiquitous in applications, particularly in portfolio optimization, as it does not rely on the design of utility nor weighing functions, which makes its success as a risk-averse criterion less sensitive to designer priors.

CVaR in ML

In machine learning, the CVaR criterion has been considered in several works. The -SVM algorithm by Schölkopf et al. (2000) can be interpreted as optimizing the CVaR of the loss, as shown by Gotoh and Takeda (2016). Also related, Shalev-Shwartz and Wexler (2016) propose an adaptive sampling algorithm to minimize the maximal loss among all samples. The maximal loss is the limiting case of the CVaR when . Fan et al. (2017) generalize this work to the top- average loss. Although they do not mention the relationship to the CVaR, their learning criterion is the definition of the CVaR for empirical measures. Furthermore, Fan et al. (2017) use an optimization algorithm proposed by Ogryczak and Tamir (2003) to optimize the maximum of the sum of functions that is the same as the algorithm proposed by Rockafellar et al. (2000) to optimize the CVaR. Recent applications of the CVaR in ML include risk-averse bandits (Sani et al., 2012)

, risk-averse reinforcement learning

(Chow et al., 2017), and fairness (Williamson and Menon, 2019). All of these works use the original formulation provided in Rockafellar et al. (2000) to optimize the CVaR. One of the major shortcomings of this formulation is that mini-batch gradient estimates have high variance. In this work, we address this via adaptive sampling and develop a method that allows us to scale up previous work to larger datasets and more complex models.

Robust optimization

The dual representation of the CVaR that we use in this paper has a distributionally robust optimization (DRO) interpretation (Shapiro et al., 2009, Section 6.3). In this direction, Namkoong and Duchi (2016) generalize the work of Shalev-Shwartz and Wexler (2016) for a particular class of -divergences, also using an adaptive sampling algorithm. Similarly, Ahmadi-Javid (2012) introduces the entropic value-at-risk by considering a different DRO set. Duchi et al. (2016); Namkoong and Duchi (2017); Esfahani and Kuhn (2018); Staib and Jegelka (2019) address related DRO problems. In this work, we use the DRO formulation of the CVaR to phrase the optimization problem as a game. To solve the game, we propose an adaptive algorithm for the learning problem. The algorithm is related to Namkoong and Duchi (2016), but we use a different type of robust set. Furthermore, we provide efficient algorithms to apply the DRO problem to large-scale datasets.

Efficient Combinatorial Bandits

A central contribution of our work is an efficient sampling algorithm based on an instance of combinatorial bandits, the -set problem. In this setting, the learner must choose a subset of out of experts with maximum rewards, and there are such sets. Cesa-Bianchi and Lugosi (2012) introduce this setting and introduce the CombBand algorithm that attains a regret of when the learner receives bandit feedback. Audibert et al. (2013) prove a lower bound of , which is attained by Alatur et al. (2019) up to a factor. However, the computational and space complexity of the CombBand algorithm is and , respectively. Uchiya et al. (2010) propose an efficient sampling algorithm that has computational and space complexity. Instead, we adapt the algorithm proposed by Alatur et al. (2019) using Determinantal Point Processes (Kulesza et al., 2012). Our algorithm has computational and space complexity.

3 Problem Statement

We consider supervised learning with a risk-averse learner. The learner has a data set comprised of i.i.d. samples from an unknown distribution, i.e.,

, and her goal is to learn a function that is parametrized by . The performance of

at a data point is measured by a loss function

. Overloading notation, we write the random variable

. The learner’s goal is to minimize the CVaR of the loss distribution w.r.t. the parameters and the (unknown) distribution .

CVaR properties

The CVaR of a random variable is defined as , where is the quantile of the distribution, also called the Value-at-Risk (VaR). We illustrate the mean, VaR and CVaR of a typical loss distribution in Figure 1.

[, width=0.95]figures/cvar_plot.pdf

Figure 1: Illustration of the CVaR of a Loss

The CVaR of a random variable is the expected value of the same random variable but w.r.t. a different law. This law arises from the following optimization problem (Shapiro et al., 2009, Section 6.3):

(1)

where . The distribution that solves Problem (1) places all the mass in the tail, i.e., the blue shaded region of Figure 1. Rockafellar et al. (2000) prove strong duality for Problem (1). The dual program is:

(2)
Learning with the CVaR

Problem (2) can be used to estimate the CVaR of a random variable by replacing the expectation by the empirical expectation . The learning problem is:

(3)

Learning problem (3) has computable subgradients, and hence lends itself to subgradient-based optimization. Furthermore, when is a convex function, then the learning problem (3) is jointly convex in .

Next, we show that the learning problem (3) is a sensible learning rule in the sense that the empirical CVaR concentrates around the population CVaR uniformly for all functions .

Proposition 1.

Let be a finite function class . Let be a random variable. Then, for any , with probability at least ,

Proof.

See Section A.1. ∎

The result above is easily extended to classes with finite VC (pseudo-)dimension.

Challenges for stochastic optimization

In the common case that a variant of SGD is used to optimize the learning problem (3), the expectation is approximated with a mini-batch of data. But, when this batch is sampled uniformly at random from the data, only a fraction of points will contain gradient information. The rest gets truncated to zero by the non-linearity. Furthermore, the gradient of the examples that do contain information is scaled by , leading to exploding gradients. These facts make stochastic optimization of Problem (3) extremely noisy, as we demonstrate empirically in Section 5.

We realize that the root of the problem is the mismatch between the sampling distribution and the unknown distribution , from which we would ideally want to sample. In fact, Problem (3) can be interpreted as a form of rejection sampling – samples with losses smaller than are rejected. It is well known that Monte Carlo estimation of rare events suffers from high variance (Rubino and Tuffin, 2009). To address this issue, we propose a novel sampling algorithm that adaptively learns to sample events from the distribution . Furthermore, the algorithm adapts to the different parameters that are encountered during the optimization algorithm.

4 Adaptive Sampling for Empirical Cvar Learning

4.1 Reformulation of CVaR Optimization

We propose to directly address the DRO problem (1) on the empirical measure for learning. The DRO set is then with . The learning problem becomes:

(4)

where has index . The learning problem (4) can be interpreted as a game between a -player (the learner), whose goal is to minimize the objective function by selecting , against a -player (the sampler), whose goal is to maximize the objective function by selecting .

Note that for each , the inner optimization in the game (4

) is a linear program. Hence, its solution is a vertex of the set

. Thus, the game becomes:

(5)

for . Then, for a fixed , the inner optimization is easily solved by simply sorting the losses, and selecting the largest . For large data, however, this is prohibitive, as it would require computing the losses for all data points, invalidating all benefits of stochastic optimization.

Fan et al. (2017) directly propose the combinatorial learning problem (5), without motivating it with the CVaR. Nevertheless, to solve it they use the high-variance algorithm for Problem (3).

Instead, we propose to solve the game (4) directly. A powerful approach for this is to use online no-regret algorithms for each player (Rakhlin and Sridharan, 2013). The learning protocol in Algorithm 1 proceeds as follows. In each round, the sampler-player samples a point from the data set using distribution . Based on it, the learner-player updates the model . The players then only observe the loss at the sampled point for the selected . If both players suffer sublinear regret, the game dynamics converge to the solution of Problem (4). We provide details below.

Input: Data set
Input: Learning algorithm for - and -players.
for  do
       -player samples .
       -player chooses .
       Both players incur costs .
       Players see and .
      
end for
Algorithm 1 Learning Protocol

4.2 Sampler (-Player) Algorithm

From an online optimization perspective, the sampler player is faced with a vector of losses that the adversary sets through

. To avoid clutter, we use the notation , and . The goal of the sampler player is to control its regret:

(6)

The regret measures how good the sequence of actions of the sampler are, compared to the best single action in hindsight (after seeing the sequence of iterates ).

The sampler player faces two challenges. First, it needs to select its distribution before the learner player. Second, it only observes the loss at a single point .

We first note that this problem is an adversarial linear bandit problem (Lattimore and Szepesvári, 2018, Chapter 27). Below, we exploit further structure, which allows to devise more efficient algorithms than available for general linear bandit problems. Utilizing Observation (5), we can restrict the player to select a subset of size elements from the ground set and compare against the best such subset. Thus, we face a combinatorial bandit problem (Lattimore and Szepesvári, 2018, Chapter 30).

The algorithm with best regret bounds in this setting is Algorithm 1 from Alatur et al. (2019). Below, we introduce an algorithm k.EXP.3 based on their approach, which enables a highly efficient implementation in Section 4.5. The main idea is to maintain and update distributions over the subsets. Naively implemented, such an approach is not practical as it requires storing and updating a variable . Instead, we make use of special structure of k-Determinantal Point Processes.

Definition 4.1 (k-DPP, Kulesza et al. (2012)).

A -Determinantal Point Process over a ground set is a distribution over all subsets of size such that the probability of a set is:

where is a positive definite kernel matrix and is the submatrix of indexed by the elements of .

We now introduce our algorithm, which we call k.EXP.3 (Algorithm 2). It is similar to the classical EXP.3 algorithm (Auer et al., 2002), in that it only stores a variable and updates only one entry in each iteration. Instead of sampling proportionally to , however, it samples from the marginal distribution of the -DPP parameterized by . For , the algorithms are identical, and for , each

is simply the uniform distribution. Crucially, the decision vector

is proportional to the marginal distribution of the k-DPP with kernel .

Input: Learning rate
Initialize weights to .
for  do
       Sample element .
       Observe loss .
   

    Build unbiased estimate

       Update weights .
end for
Algorithm 2 k.EXP.3
Lemma 1.

Let the sampler player play the k.EXP.3 Algorithm with . Then, she suffers a sampler regret (6) of at most .

Proof.

For a detailed proof please refer to Section A.2. Here, we just sketch the proof. For the iterates of k.EXP.3, we need the following three facts. First, we prove in Proposition 2 that the iterates of the algorithm are effectively in . Second, we prove in Proposition 3 that the comparator in the regret of Alatur et al. (2019) and in the sampler regret (6) have the same value (scaled by ). Finally, the result follows as a corollary from these propositions and Alatur et al. (2019, Lemma 1). ∎

4.3 Learner (-Player) Algorithm

Analogous to the sampler player, the objective of the learner player is to control its regret defined as:

(7)

The goal of this section is to discuss a sublinear-regret algorithm for the learner player. The key observation is that this player chooses after the sampler player selects . For this reason, the learner player can play the Be-The-Leader (BTL) algorithm, namely:

(8)

where is the average distribution (up to time ) that the sampler player proposes.

Lemma 2.

A learner player that plays the BTL algorithm suffers at most zero regret.

Proof.

See Section A.3. ∎

For each new that the sampler player selects, the learner player must solve a weighted empirical loss minimization in Problem (8). For convex problems, we know that it is not necessary to solve exactly the BTL algorithm (8) and algorithms such as online-SGD (Zinkevich, 2003) achieve no-regret guarantees. We refer the reader to Appendix B for a discussion of the convex case. The non-convex case is more challenging as solving a non-convex optimization problem is in general NP-hard (Murty and Kabadi, 1987). Obtaining provable no-regret guarantees in the non-convex online setting seems unrealistic in general.

Despite this hardness, the success of deep learning empirically demonstrates that stochastic optimization algorithms such as SGD are able to find very good (even if not necessarily optimal) solutions for the associated non-convex problems. Hence, we approximate

by the sequence of samples that the sampler player provides. We perform stochastic optimization (SGD or its variants) with respect to these samples. Namely, for each , the learner chooses . Note that this is not online-SGD because in BTL the loss is observed at index , whereas in SGD at index .

4.4 Game Dynamics

We now show that if both players play the no-regret algorithms discussed above, they (approximately) solve the game (4). The minimax equilibrium of the game is the point that satisfies . We assume that this point exists (e.g., when the sets and are compact). The game regret is:

(9)
Theorem 1 (Game Sublinear-Regret).

Let , be a fixed set of loss functions. If the sampler player uses k.EXP.3 (Algorithm 2) and the learner player uses the BTL algorithm, then the game has regret .

Proof.

To bound the regret, we bound it with the sum of the Learner and Sampler regret as follows:

Theorem 2 (Implications for learning with the CVaR).

Let , be a set of loss functions sampled from a distribution . Let be the minimizer of the CVaR of the empirical distribution . Let be the output of the sequence of iterates of the two-player algorithm. The average excess CVaR of the algorithm is bounded as:

Proof.

The average excess CVaR is bounded by the average duality gap, which in turn is upper-bounded by the average game regret. In Theorem 1 we proved that the Game Regret is sublinear, hence the average excess CVaR goes to zero. ∎

4.5 Efficient Sampling from k-DPP Marginals

In Section 4.2 we propose an algorithm that maintains and updates the diagonal elements of a k-DPP , but we did not address how to sample nor how to compute the marginal distribution .

The challenges in our setting are the following. First, we aim for a sampling algorithm with low computational complexity (at most ), to maintain the computational advantages of stochastic optimization. Second, we aim for an algorithm that is numerically stable for large k-DPPs, to scale to large-scale data sets. Finally, the k-DPP changes between iterations, hence we need a method that efficiently adapts to changing distributions.

Sample Complexity

State-of-the-art exact sampling methods for k-DPPs take at least operations using rejection sampling (Dereziński et al., 2019), whereas approximate methods that use MCMC have mixing times of (Li et al., 2016; Anari et al., 2016). Compared to general k-DPPs, our setting has the advantage that the k-DPP is diagonal and there is no need for performing an eigendecomposition of the kernel matrix. Instead, we directly sample from the singleton-marginal distribution, which takes using the same sum-tree data-structure as Shalev-Shwartz and Wexler (2016). The marginals of diagonal k-DPPs are:

(10)

where is the elementary symmetric polynomial of size for the ground set and is the elementary symmetric polynomial of size for the ground set .

Large-Scale Approximation

Naively computing the elementary symmetric polynomials has a complexity of using (Kulesza et al., 2012, Algorithm 7). Using a specialized binary tree algorithm takes . Even if this computation could be performed fast, exact computation of the elementary symmetric polynomials is numerically unstable.

Barthelmé et al. (2019) observe this issue and propose an approximation to k-DPPs valid for large-scale ground sets which has better numerical properties. They show empirically that for , computing the marginals (10) leads to numerical overflow. The main idea Barthelmé et al. (2019) propose is to relax the sample size constraint of the k-DPP with a soft constraint such that the expected sample size of the matched DPP is . The total variation distance between the marginal probabilities of the k-DPP and DPP has rate when and . The marginal probabilities of this matched DPP are:

(11)

where softly enforces the sample size constraint . Direct sampling from these marginals still takes but the numerical properties are superior to those in Eq. (10).

DPP Update

The remaining challenge is how to update the approximate DPP between two different iterations of the sampling algorithm. Solving the coupling constraint takes operations and there is no closed-form solution for . We found three possible solutions. First, we can take operations to solve for . In practice, solving this equation is extremely fast when using the previous solution as a warm start. Second, we can just keep

constant every epoch and update it only every

steps. This deteriorates the approximation slightly, particularly in small scale applications. The third option is to use the implicit gradient theorem to calculate a first order approximation of when changes in the coupling constraint. The gradient is , therefore the update rule is . Note that this also introduces an approximation error, thus, every steps one must solve again for . In experiments, we observe no performance difference between the first and third strategy.

5 Experiments

[width=0.48]figures/classification/cvar.png  [width=0.48]figures/classification/loss.png
[width=0.48]figures/classification/var.png  [width=0.48]figures/classification/accuracy.png [width=0.48]figures/regression/cvar.png  [width=0.48]figures/regression/loss.png

Figure 2:

Results for Classification Tasks. In the top row we show the test losses CVaR and average, respectively. In the bottom row we show test losses Var and the classifier accuracy, respectively. We normalize the CVaR, Var and Loss plots between data sets for visual comparison.

Figure 3: Results for Regression Tasks. In the left plot we show the test losses CVaR. In the right plot we show the test losses average. We normalize the CVaR and Loss plots between data sets for visual comparison.

[width=0.33]figures/vision/cvar.png  [width=0.33]figures/vision/loss.png  [width=0.33]figures/vision/var.png
[width=0.33]figures/vision/accuracy.png  [width=0.33]figures/vision/F1-Score.png  [width=0.33]figures/vision/Accuracy-to-CVaR_Ratio.png [width=0.33]figures/fashion-mnist/cvar.png  [width=0.33]figures/fashion-mnist/loss.png  [width=0.33]figures/fashion-mnist/var.png
[width=0.33]figures/fashion-mnist/accuracy.png  [width=0.33]figures/fashion-mnist/f1.png  [width=0.33]figures/fashion-mnist/ratio.png

Figure 4:

Results for Vision Data sets. In the top row we plot the CVaR, average and VaR of the loss, normalized to one to compare between data sets. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. Full bars indicate the mean and error bars one standard deviation. For CIFAR-10,

soft yields numerical overflow, thus we omit the results.
Figure 5: Learning Dynamics on the Fashion-MNIST data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

5.1 Experimental Setup

We optimize the CVaR of a loss at level , hence , where is the size of the data set. In classification tasks, we consider the cross-entropy loss as a surrogate of the 0/1 loss, and in regression tasks, we consider the squared loss.

We test our algorithm on eight different UCI classification data sets, three UCI regression data sets and three synthetic regression data sets (Dua and Graff, 2017). We use linear classifiers/regressors to ensure a convex optimization setting.

We also test our algorithm in large-scale data sets and classes of nonlinear function approximators, yielding non-convex problems. We use LeNet-5 neural network

(LeCun et al., 1995) for MNIST (LeCun et al., 1995) and we add dropout (Hinton et al., 2012) for Fashion-MNIST (Xiao et al., 2017); for CIFAR-10 (Krizhevsky et al., 2014) we use (Simonyan and Zisserman, 2014) with batch-norm (Ioffe and Szegedy, 2015).

Our algorithm applies our adaptive sampling algorithm to select data points for the learner player, who uses a variant of SGD to optimize . We compare it to three baselines: first, an i.i.d. sampling scheme that optimizes Problem (3) (cvar); second, an i.i.d. sampling scheme that uses the mean of the losses as a learning algorithm (mean); third, an i.i.d. sampling scheme that uses a mini-batch relaxation of the non-linearity with proposed by Nemirovski and Shapiro (2006) (soft). For optimizing we employ ADAM (Kingma and Ba, 2014). For more details, please refer to Appendix C.

5.2 Convex Learning Results

We show results for classification tasks in Figure 3 and for regression tasks in Figure 3. Here, we only use linear function approximators, so we evaluate the different algorithms in a controlled convex setting, where the cvar algorithm has convergence guarantees.

Regression Tasks

In regression tasks, our adaptive algorithm outperforms all other algorithms in terms of the CVaR of the loss. Furthermore, the average loss of adaptive also outperforms the average loss of cvar. In fact, it is also competitive with mean, and in normal and pareto data sets it outperforms it. This shows that our adaptive algorithm benefits from the initial iterates of the sampling algorithm, when it learns about the average loss. As the optimization algorithm advances, the sampling algorithm focuses on harder examples, reducing also the CVaR.

Classification Tasks

In classification tasks, the original cvar algorithm performs best in terms of the CVaR of the surrogate loss. However, this comes at considerable cost for the accuracy, which for cvar is worse than the other algorithms in almost all data sets. Also, the average loss of the cvar algorithm is considerably higher than the loss of other algorithms.

Our adaptive algorithm achieves the best of both worlds, also in classification tasks. In the first few epochs, when samples come from the uniform distribution, it optimizes the accuracy and, once the accuracy is good, it learns about the extreme events. adaptive has accuracy comparable to mean in almost all data sets and outperforms it in terms of the CVaR. The soft algorithm also yields good accuracy but usually a much higher CVaR than the adaptive algorithm.

The Value-at-Risk is also a (non-coherent) risk measure that is commonly used in practice: it is the  quantile of the distribution. Quantiles are non-differentiable functions of the samples and there are no easy algorithms for minimizing the quantile of a loss. Instead, the CVaR is a tight convex upper bound for the quantile (Nemirovski and Shapiro, 2006). Thus, the VaR evaluation criterion is a common use case of our algorithm. We see that, in convex settings, the cvar and our adaptive algorithm perform similarly.

These experiments suggest that in the convex setting, our adaptive algorithm is competitive to the cvar algorithm and outperforms the soft variant. Furthermore, our algorithm benefits from the initial states where the sampling distribution is close to uniform to get competitive performance w.r.t. the mean algorithm, when comparing the expected loss.

5.3 Large-Scale Non-Convex Learning

Figure 5 shows results for Fashion-MNIST, MNIST, and CIFAR-10, and Figure 5

ilustrates the learning dynamics of Fashion-MNIST. To obtain confidence intervals in the plots, we repeat the experiments with five different random seeds. The multi-class F1-score is the harmonic average of the minimum one vs. all precision and recall. For the CIFAR-10 data set, the

soft algorithm yielded numerical overflow during training.

CVaR-Accuracy Tradeoff

In the left figures, we see that the cvar algorithm has the lowest CVaR but has terrible accuracy, particularly in Fashion-MNIST and CIFAR-10 data sets. This agrees with the results from the UCI data sets. Our adaptive algorithm outperforms all others in the Accuracy-to-CVaR ratio in the bottom right figure in Figure 5.

In Figure 5, we see that our adaptive algorithm starts optimizing the average loss (as the mean algorithm), but then follows the optimization of the CVaR (as the cvar algorithm). However, it reaches simultaneously the accuracy of the mean and the CVaR of the cvar.

Value-at-Risk and Average Loss

Our adaptive algorithm outperforms all the other algorithms in Fashion-MNIST and for CIFAR-10 it is competitive with the mean algorithm. The cvar algorithm has poor performance.

6 Conclusions

We consider the CVaR of the loss distribution as a risk-averse learning criterion. We notice that the typical way of solving the optimization algorithm is not useful for modern machine learning tasks due to high variance of the gradient estimates. To address this issue, we propose an adaptive sampling algorithm that is based on a distributionally robust formulation of the CVaR. We provably solve the game by applying SGD on the sequence of examples that the adaptive sampling algorithm provides. Furthermore, we provide an efficient implementation for the adaptive sampling algorithm based on DPPs. Finally, we demonstrate in a range of experiments that our adaptive algorithm is superior to the cvar algorithm in Problem (3) in regression and classification tasks, both in the convex and non-convex learning settings.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement No. 815943. It also received funding from a Sloan Research Fellowship and the Defense Advanced Research Projects Agency (grant numberYFA17 N66001-17-1-4039). The views, opinions, and/or findings contained in this article are those of the author and should not be interpreted as representing the official views or policies, eitherexpressed or implied, of the Defense Advanced Research Projects Agency or the Department of Defense.

References

  • A. Ahmadi-Javid (2012) Entropic value-at-risk: a new coherent risk measure. Journal of Optimization Theory and Applications 155 (3), pp. 1105–1123. Cited by: §2.
  • P. Alatur, K. Y. Levy, and A. Krause (2019) Multi-player bandits: the adversarial case. arXiv preprint arXiv:1902.08036. Cited by: §A.2, §2, §4.2, §4.2.
  • N. Anari, S. O. Gharan, and A. Rezaei (2016)

    Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes

    .
    In Conference on Learning Theory, pp. 103–115. Cited by: §4.5.
  • P. Artzner, F. Delbaen, J. Eber, and D. Heath (1999) Coherent measures of risk. Mathematical finance 9 (3), pp. 203–228. Cited by: §2.
  • J. Audibert, S. Bubeck, and G. Lugosi (2013)

    Regret in online combinatorial optimization

    .
    Mathematics of Operations Research 39 (1), pp. 31–45. Cited by: Appendix B, §2.
  • P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §4.2.
  • S. Barthelmé, P. Amblard, N. Tremblay, et al. (2019) Asymptotic equivalence of fixed-size and varying-size determinantal point processes. Bernoulli 25 (4B), pp. 3555–3589. Cited by: §4.5.
  • A. Beck and M. Teboulle (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters 31 (3), pp. 167–175. Cited by: Appendix B.
  • D. B. Brown (2007) Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters 35 (6), pp. 722–730. Cited by: §A.1.
  • C. Brownlees, E. Joly, G. Lugosi, et al. (2015) Empirical risk minimization for heavy-tailed losses. The Annals of Statistics 43 (6), pp. 2507–2536. Cited by: Appendix C.
  • M. C. Carneiro, G. P. Ribas, and S. Hamacher (2010) Risk management in the oil supply chain: a cvar approach. Industrial & Engineering Chemistry Research 49 (7), pp. 3286–3294. Cited by: §1.
  • N. Cesa-Bianchi and G. Lugosi (2012) Combinatorial bandits. Journal of Computer and System Sciences 78 (5), pp. 1404–1422. Cited by: §2.
  • Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2017) Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §2.
  • M. Dereziński, D. Calandriello, and M. Valko (2019) Exact sampling of determinantal point processes with sublinear time preprocessing. arXiv preprint arXiv:1905.13476. Cited by: §4.5.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Appendix C, §5.1.
  • J. Duchi, P. Glynn, and H. Namkoong (2016) Statistics of robust optimization: a generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425. Cited by: §2.
  • J. P. Eaton and C. A. Haas (1995) Titanic, triumph and tragedy. WW Norton & Company. Cited by: Appendix C.
  • P. M. Esfahani and D. Kuhn (2018) Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming 171 (1-2), pp. 115–166. Cited by: §2.
  • Y. Fan, S. Lyu, Y. Ying, and B. Hu (2017) Learning with average top-k loss. In Advances in Neural Information Processing Systems, pp. 497–505. Cited by: Appendix C, §2, §4.1.
  • J. Gotoh and A. Takeda (2016)

    CVaR minimizations in support vector machines

    .
    Financial Signal Processing and Machine Learning, pp. 233–265. Cited by: §2.
  • E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: Appendix B.
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §5.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.1.
  • D. Kahneman and A. Tversky (2013) Prospect theory: an analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I, pp. 99–127. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: Appendix C, §5.1.
  • P. Krokhmal, J. Palmquist, and S. Uryasev (2002) Portfolio optimization with conditional value-at-risk objective and constraints. Journal of risk 4, pp. 43–68. Cited by: §1.
  • A. Kulesza, B. Taskar, et al. (2012) Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5 (2–3), pp. 123–286. Cited by: §2, §4.5, Definition 4.1.
  • T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint. Cited by: §4.2.
  • Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §5.1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Appendix C.
  • C. Li, S. Sra, and S. Jegelka (2016) Fast mixing markov chains for strongly rayleigh measures, dpps, and constrained sampling. In Advances in Neural Information Processing Systems, pp. 4188–4196. Cited by: §4.5.
  • K. G. Murty and S. N. Kabadi (1987) Some np-complete problems in quadratic and nonlinear programming. Mathematical programming 39 (2), pp. 117–129. Cited by: §4.3.
  • K. G. Murty (1983) Linear programming. Springer. Cited by: §A.2.
  • H. Namkoong and J. C. Duchi (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pp. 2208–2216. Cited by: §2.
  • H. Namkoong and J. C. Duchi (2017) Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pp. 2971–2980. Cited by: §2.
  • A. Nemirovski and A. Shapiro (2006) Convex approximations of chance constrained programs. SIAM Journal on Optimization 17 (4), pp. 969–996. Cited by: §5.1, §5.2.
  • W. Ogryczak and A. Tamir (2003) Minimizing the sum of the k largest functions in linear time. Information Processing Letters 85 (3), pp. 117–122. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    In NIPS Autodiff Workshop, Cited by: Appendix C.
  • J. W. Pratt (1978) Risk aversion in the small and in the large. In Uncertainty in Economics, pp. 59–79. Cited by: §2.
  • M. Rabin (2013) Risk aversion and expected-utility theory: a calibration theorem. In Handbook of the Fundamentals of Financial Decision Making: Part I, pp. 241–252. Cited by: §2.
  • S. Rakhlin and K. Sridharan (2013) Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pp. 3066–3074. Cited by: §4.1.
  • R. T. Rockafellar, S. Uryasev, et al. (2000) Optimization of conditional value-at-risk. Journal of risk 2, pp. 21–42. Cited by: §2, §2, §3.
  • G. Rubino and B. Tuffin (2009) Rare event simulation using monte carlo methods. John Wiley & Sons. Cited by: §3.
  • A. Sani, A. Lazaric, and R. Munos (2012) Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 3275–3283. Cited by: §2.
  • B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett (2000) New support vector algorithms. Neural computation 12 (5), pp. 1207–1245. Cited by: §2.
  • S. Shalev-Shwartz and Y. Wexler (2016) Minimizing the maximal loss: how and why.. In ICML, pp. 793–801. Cited by: §2, §2, §4.5.
  • A. Shapiro, D. Dentcheva, and A. Ruszczyński (2009) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: §2, §3.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
  • M. Staib and S. Jegelka (2019) Distributionally robust optimization and generalization in kernel methods. arXiv preprint arXiv:1905.10943. Cited by: §2.
  • T. Uchiya, A. Nakamura, and M. Kudo (2010) Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pp. 375–389. Cited by: §2.
  • V. Vapnik (1992) Principles of risk minimization for learning theory. In Advances in neural information processing systems, pp. 831–838. Cited by: §1.
  • R. Williamson and A. Menon (2019) Fairness risk measures. In International Conference on Machine Learning, pp. 6786–6797. Cited by: §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: Appendix C, §5.1.
  • M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936. Cited by: Appendix B, §4.3.

References

  • A. Ahmadi-Javid (2012) Entropic value-at-risk: a new coherent risk measure. Journal of Optimization Theory and Applications 155 (3), pp. 1105–1123. Cited by: §2.
  • P. Alatur, K. Y. Levy, and A. Krause (2019) Multi-player bandits: the adversarial case. arXiv preprint arXiv:1902.08036. Cited by: §A.2, §2, §4.2, §4.2.
  • N. Anari, S. O. Gharan, and A. Rezaei (2016)

    Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes

    .
    In Conference on Learning Theory, pp. 103–115. Cited by: §4.5.
  • P. Artzner, F. Delbaen, J. Eber, and D. Heath (1999) Coherent measures of risk. Mathematical finance 9 (3), pp. 203–228. Cited by: §2.
  • J. Audibert, S. Bubeck, and G. Lugosi (2013)

    Regret in online combinatorial optimization

    .
    Mathematics of Operations Research 39 (1), pp. 31–45. Cited by: Appendix B, §2.
  • P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire (2002) The nonstochastic multiarmed bandit problem. SIAM journal on computing 32 (1), pp. 48–77. Cited by: §4.2.
  • S. Barthelmé, P. Amblard, N. Tremblay, et al. (2019) Asymptotic equivalence of fixed-size and varying-size determinantal point processes. Bernoulli 25 (4B), pp. 3555–3589. Cited by: §4.5.
  • A. Beck and M. Teboulle (2003) Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters 31 (3), pp. 167–175. Cited by: Appendix B.
  • D. B. Brown (2007) Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters 35 (6), pp. 722–730. Cited by: §A.1.
  • C. Brownlees, E. Joly, G. Lugosi, et al. (2015) Empirical risk minimization for heavy-tailed losses. The Annals of Statistics 43 (6), pp. 2507–2536. Cited by: Appendix C.
  • M. C. Carneiro, G. P. Ribas, and S. Hamacher (2010) Risk management in the oil supply chain: a cvar approach. Industrial & Engineering Chemistry Research 49 (7), pp. 3286–3294. Cited by: §1.
  • N. Cesa-Bianchi and G. Lugosi (2012) Combinatorial bandits. Journal of Computer and System Sciences 78 (5), pp. 1404–1422. Cited by: §2.
  • Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone (2017) Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18 (1), pp. 6070–6120. Cited by: §2.
  • M. Dereziński, D. Calandriello, and M. Valko (2019) Exact sampling of determinantal point processes with sublinear time preprocessing. arXiv preprint arXiv:1905.13476. Cited by: §4.5.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Appendix C, §5.1.
  • J. Duchi, P. Glynn, and H. Namkoong (2016) Statistics of robust optimization: a generalized empirical likelihood approach. arXiv preprint arXiv:1610.03425. Cited by: §2.
  • J. P. Eaton and C. A. Haas (1995) Titanic, triumph and tragedy. WW Norton & Company. Cited by: Appendix C.
  • P. M. Esfahani and D. Kuhn (2018) Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming 171 (1-2), pp. 115–166. Cited by: §2.
  • Y. Fan, S. Lyu, Y. Ying, and B. Hu (2017) Learning with average top-k loss. In Advances in Neural Information Processing Systems, pp. 497–505. Cited by: Appendix C, §2, §4.1.
  • J. Gotoh and A. Takeda (2016)

    CVaR minimizations in support vector machines

    .
    Financial Signal Processing and Machine Learning, pp. 233–265. Cited by: §2.
  • E. Hazan et al. (2016) Introduction to online convex optimization. Foundations and Trends® in Optimization 2 (3-4), pp. 157–325. Cited by: Appendix B.
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §5.1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.1.
  • D. Kahneman and A. Tversky (2013) Prospect theory: an analysis of decision under risk. In Handbook of the fundamentals of financial decision making: Part I, pp. 99–127. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
  • A. Krizhevsky, V. Nair, and G. Hinton (2014) The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html 55. Cited by: Appendix C, §5.1.
  • P. Krokhmal, J. Palmquist, and S. Uryasev (2002) Portfolio optimization with conditional value-at-risk objective and constraints. Journal of risk 4, pp. 43–68. Cited by: §1.
  • A. Kulesza, B. Taskar, et al. (2012) Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5 (2–3), pp. 123–286. Cited by: §2, §4.5, Definition 4.1.
  • T. Lattimore and C. Szepesvári (2018) Bandit algorithms. preprint. Cited by: §4.2.
  • Y. LeCun, Y. Bengio, et al. (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §5.1.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: Appendix C.
  • C. Li, S. Sra, and S. Jegelka (2016) Fast mixing markov chains for strongly rayleigh measures, dpps, and constrained sampling. In Advances in Neural Information Processing Systems, pp. 4188–4196. Cited by: §4.5.
  • K. G. Murty and S. N. Kabadi (1987) Some np-complete problems in quadratic and nonlinear programming. Mathematical programming 39 (2), pp. 117–129. Cited by: §4.3.
  • K. G. Murty (1983) Linear programming. Springer. Cited by: §A.2.
  • H. Namkoong and J. C. Duchi (2016) Stochastic gradient methods for distributionally robust optimization with f-divergences. In Advances in Neural Information Processing Systems, pp. 2208–2216. Cited by: §2.
  • H. Namkoong and J. C. Duchi (2017) Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pp. 2971–2980. Cited by: §2.
  • A. Nemirovski and A. Shapiro (2006) Convex approximations of chance constrained programs. SIAM Journal on Optimization 17 (4), pp. 969–996. Cited by: §5.1, §5.2.
  • W. Ogryczak and A. Tamir (2003) Minimizing the sum of the k largest functions in linear time. Information Processing Letters 85 (3), pp. 117–122. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    In NIPS Autodiff Workshop, Cited by: Appendix C.
  • J. W. Pratt (1978) Risk aversion in the small and in the large. In Uncertainty in Economics, pp. 59–79. Cited by: §2.
  • M. Rabin (2013) Risk aversion and expected-utility theory: a calibration theorem. In Handbook of the Fundamentals of Financial Decision Making: Part I, pp. 241–252. Cited by: §2.
  • S. Rakhlin and K. Sridharan (2013) Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, pp. 3066–3074. Cited by: §4.1.
  • R. T. Rockafellar, S. Uryasev, et al. (2000) Optimization of conditional value-at-risk. Journal of risk 2, pp. 21–42. Cited by: §2, §2, §3.
  • G. Rubino and B. Tuffin (2009) Rare event simulation using monte carlo methods. John Wiley & Sons. Cited by: §3.
  • A. Sani, A. Lazaric, and R. Munos (2012) Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 3275–3283. Cited by: §2.
  • B. Schölkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett (2000) New support vector algorithms. Neural computation 12 (5), pp. 1207–1245. Cited by: §2.
  • S. Shalev-Shwartz and Y. Wexler (2016) Minimizing the maximal loss: how and why.. In ICML, pp. 793–801. Cited by: §2, §2, §4.5.
  • A. Shapiro, D. Dentcheva, and A. Ruszczyński (2009) Lectures on stochastic programming: modeling and theory. SIAM. Cited by: §2, §3.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.
  • M. Staib and S. Jegelka (2019) Distributionally robust optimization and generalization in kernel methods. arXiv preprint arXiv:1905.10943. Cited by: §2.
  • T. Uchiya, A. Nakamura, and M. Kudo (2010) Algorithms for adversarial bandit problems with multiple plays. In International Conference on Algorithmic Learning Theory, pp. 375–389. Cited by: §2.
  • V. Vapnik (1992) Principles of risk minimization for learning theory. In Advances in neural information processing systems, pp. 831–838. Cited by: §1.
  • R. Williamson and A. Menon (2019) Fairness risk measures. In International Conference on Machine Learning, pp. 6786–6797. Cited by: §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: Appendix C, §5.1.
  • M. Zinkevich (2003) Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 928–936. Cited by: Appendix B, §4.3.

Appendix A Proofs

a.1 Proof of Proposition 1

Proposition 1.

Let be a function class with finite VC-dimension . Let be a random variable. Then, for any , with probability at least ,

Proof for Proposition 1.

Brown (2007) proves that the following two inequalities hold jointly with probability , for a single :

Taking the union bound over all :

The theorem follows from taking the maximum between lower and upper bounds. ∎

a.2 Proof of Lemma 1

Lemma 1.

Let the sampler player play the k.EXP.3 Algorithm with . Then, she suffers a sampler regret (6) of at most .

In order to prove this, we need to first show that k.EXP.3 is a valid algorithm for the sampler player. This we do next.

Proposition 2.

The marginals of any k-DPP with a diagonal matrix kernel are in the set .

Proof.

For any the marginals of the k-DPP with kernel are:

(12)

From eq. 12, clearly . Summing eq. 12 over we get:

This shows that . ∎

Proposition 3.

Let . Let the set of distributions over the subsets of size of the ground set .

(13)
Proof.

Both left and right sides of (13) are linear programs over a convex polytope, hence the solution is in one of its vertices (Murty, 1983). The vertices of are vectors . These vectors have in coordinate if the coordinate belongs to set and 0 otherwise. The vertices of the simplex are just , one for coordinate .

Let be the solution of the l.h.s. of (13). Assume that is the solution of the right hand side. This implies that . Therefore, . This in turn implies that , which contradicts the first predicate. In the case the equalities hold, then the values l.h.s and r.h.s.  of equation (13) are also equal. ∎

Proof of Lemma 1.

The first equality uses Proposition 3. The second equality uses Proposition 2 and the fact that the iterates come from the k.EXP.3 algorithm. The third equality uses the definition of . The final inequality is due to Alatur et al. (2019, Lemma 1).

a.3 Proof of Lemma 2

Lemma 2.

A learner player that plays the BTL algorithm suffers at most zero regret.

Proof.

We proceed by induction. Clearly for , . Assume true for , the inductive hypothesis is . The regret at time is:

Appendix B Learner Player Algorithm for Convex Losses

In the convex setting, there are online learning algorithms that have no-regret guarantees and there is no need to play the BTL algorithm (8) exactly. Instead, stochastic gradient descent (SGD) Zinkevich (2003) or online mirror descent (OMD) (Beck and Teboulle, 2003) both have no-regret guarantees. We focus now on SGD but, for certain geometries of and appropriate mirror maps, OMD has exponentially better regret guarantees (in terms of the dimension of the problem).

Lemma 3.

Let assume that be any sequence of convex losses, with and , then a learner player that plays SGD algorithm suffers at most regret .

Proof.

Hazan and others (2016, Chapter 3). ∎

Note that even if there are algorithms for the strongly convex case or exp-concave case that have regret, it does not bring any advantage in our case as the term in the sampler regret dominates and is unavoidable (Audibert et al., 2013).

Corollary 1.

Let be any sequence of convex losses. Let the learner player play SGD (or OMD with an appropriate mirror map) and the sampler player play Algorithm 2, then the game has regret , where is a problem-dependent constant.

Appendix C Experimental Setup

Implementation:

We implement all our experiments using PyTorch (Paszke et al., 2017).

Datasets:

For classification we use the Adult, Australian Credit Approval, German Credit Data, Monks-Problems-1, Spambase, and Splice-junction Gene Sequences datasets from the UCI repository (Dua and Graff, 2017) and the Titanic Disaster dataset from (Eaton and Haas, 1995). For regression we use the Boston Housing, Abalone, and Energy Efficiency datasets from the UCI repository, the sinc dataset is synthetic recreated from (Fan et al., 2017), and normal and pareto datasets are synthetic datasets recreated from (Brownlees et al., 2015) with Gaussian and Pareto noise, respectively. For vision datasets, we use MNIST (LeCun et al., 1998), Fashion-MNIST (Xiao et al., 2017), and CIFAR-10 (Krizhevsky et al., 2014).

Dataset Preparation:

We split UCI datasets into train and test set using a 80/20 split. We use a 10-fold cross-validation scheme on the training set. For image classification we use as validation set the same images in the train set, however without applying data-augmentations. For discrete categorical data, we use a one-hot-encoding. We normalize continuous data.

Hyper-Parameter Search:

For UCI datasets, we ran a grid search over the hyperparameters for all the algorithms with a single random seed. Then we select the set of hyper-parameters with the highest cross-validation score. The hyperparameters are:

  1. ADAM initial learning rate ,

  2. Batch Size ,

  3. Adaptive algorithm learning rate , where (*) is the optimal learning rate,

  4. Mixing with uniform distribution ,

  5. Adaptive algorithm learning rate decay scheduling ,

  6. Sampling ,

  7. soft algorithm Temperature .

For the image classification data sets, we perform the same hyperparameter search and select the highest validation score but for a single data-fold. For the test results, we repeat the experiments with five different random seeds, but left the hyper-parameters unchanged.

Evaluation Metrics:

We report results on the test set. In regression tasks, we evaluate primarily the CVaR of the loss, and secondarily the mean and the VaR (the quantile) of the loss. In classification tasks, the CVaR of the 0/1 loss is not a useful metric. If the mis-classification rate is larger than , then the CVaR of the 0/1 loss is zero, and if it is smaller than it is just the miss-clasification rate scaled up by . Therefore, we consider both the average accuracy and the CVaR w.r.t. the cross-entropy loss.

Further Experimental Results:

In Figures 13, 12, 11, 10, 9, 8, 7 and 6 we plot the learning dynamics of the classification data sets. In Figures 20, 19, 18, 17, 16, 15 and 14 we plot the learning dynamics of the regression data sets.

[width=0.33]figures/adult/cvar.png  [width=0.33]figures/adult/loss.png  [width=0.33]figures/adult/var.png
[width=0.33]figures/adult/accuracy.png  [width=0.33]figures/adult/f1.png  [width=0.33]figures/adult/ratio.png

Figure 6: Learning Dynamics on the Adult data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/australian/cvar.png  [width=0.33]figures/australian/loss.png  [width=0.33]figures/australian/var.png
[width=0.33]figures/australian/accuracy.png  [width=0.33]figures/australian/f1.png  [width=0.33]figures/australian/ratio.png

Figure 7: Learning Dynamics on the Australian Credit Approval data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/german/cvar.png  [width=0.33]figures/german/loss.png  [width=0.33]figures/german/var.png
[width=0.33]figures/german/accuracy.png  [width=0.33]figures/german/f1.png  [width=0.33]figures/german/ratio.png

Figure 8: Learning Dynamics on the German Credit Data data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/monks/cvar.png  [width=0.33]figures/monks/loss.png  [width=0.33]figures/monks/var.png
[width=0.33]figures/monks/accuracy.png  [width=0.33]figures/monks/f1.png  [width=0.33]figures/monks/ratio.png

Figure 9: Learning Dynamics on the Monks-Problems data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/phoneme/cvar.png  [width=0.33]figures/phoneme/loss.png  [width=0.33]figures/phoneme/var.png
[width=0.33]figures/phoneme/accuracy.png  [width=0.33]figures/phoneme/f1.png  [width=0.33]figures/phoneme/ratio.png

Figure 10: Learning Dynamics on the Phoneme data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/spambase/cvar.png  [width=0.33]figures/spambase/loss.png  [width=0.33]figures/spambase/var.png
[width=0.33]figures/spambase/accuracy.png  [width=0.33]figures/spambase/f1.png  [width=0.33]figures/spambase/ratio.png

Figure 11: Learning Dynamics on the Spambase data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/splice/cvar.png  [width=0.33]figures/splice/loss.png  [width=0.33]figures/splice/var.png
[width=0.33]figures/splice/accuracy.png  [width=0.33]figures/splice/f1.png  [width=0.33]figures/splice/ratio.png

Figure 12: Learning Dynamics on the Splice-junction Gene Sequences data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/titanic/cvar.png  [width=0.33]figures/titanic/loss.png  [width=0.33]figures/titanic/var.png
[width=0.33]figures/titanic/accuracy.png  [width=0.33]figures/titanic/f1.png  [width=0.33]figures/titanic/ratio.png

Figure 13: Learning Dynamics on the Titanic data. In the top row we plot the CVaR, mean and VaR of the loss. In the bottom row we plot the accuracy, F1-score and accuracy-to-cvar ratio. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/abalone/cvar.png  [width=0.33]figures/abalone/loss.png  [width=0.33]figures/abalone/var.png

Figure 14: Learning Dynamics on the Abalone data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/boston/cvar.png  [width=0.33]figures/boston/loss.png  [width=0.33]figures/boston/var.png

Figure 15: Learning Dynamics on the Boston Housing data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/cpu/cvar.png  [width=0.33]figures/cpu/loss.png  [width=0.33]figures/cpu/var.png

Figure 16: Learning Dynamics on the CPU-SMALL data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/normal/cvar.png  [width=0.33]figures/normal/loss.png  [width=0.33]figures/normal/var.png

Figure 17: Learning Dynamics on the Normal data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/pareto/cvar.png  [width=0.33]figures/pareto/loss.png  [width=0.33]figures/pareto/var.png

Figure 18: Learning Dynamics on the Pareto data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/sinc/cvar.png  [width=0.33]figures/sinc/loss.png  [width=0.33]figures/sinc/var.png

Figure 19: Learning Dynamics on the Sinc data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.

[width=0.33]figures/energy/cvar.png  [width=0.33]figures/energy/loss.png  [width=0.33]figures/energy/var.png

Figure 20: Learning Dynamics on the Energy Efficiency data. We plot the CVaR, mean and VaR of the loss. All these are from the validation set. In solid line we plot the mean and in shaded we include one standard deviation.