Tilted Empirical Risk Minimization

07/02/2020 ∙ by Tian Li, et al. ∙ Facebook Carnegie Mellon University 0

Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework—tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many statistical estimation procedures rely on the concept of empirical risk minimization (ERM), in which the parameter of interest, , is estimated by minimizing an average loss over the data:

(1)

While ERM is widely used and offers nice statistical properties, it can also perform poorly in practical situations where average performance is not an appropriate surrogate for the objective of interest. Significant research has thus been devoted to developing alternatives to traditional ERM for diverse applications, such as learning in the presence of noisy/corrupted data or outliers [30, 25], performing classification with imbalanced data [37, 38], ensuring that subgroups within a population are treated fairly [56, 36, 42], or developing solutions with favorable out-of-sample performance [43].

In this paper, we suggest that deficiencies in ERM can be flexibly addressed via a unified framework, tilted empirical risk minimization (TERM). TERM encompasses a family of objectives, parameterized by a real-valued hyperparameter, . For the -tilted loss (TERM objective) is given by:111 is defined in (14) as the limit of when

(2)

TERM generalizes ERM as the -tilted loss recovers the average loss, i.e., (Lemma 2, Appendix A.2). It also recovers other common alternatives, e.g., recovers the max-loss, and the min-loss (Lemma 2, Appendix A.2). For , the objective is a common form of exponential smoothing, used to approximate the max [31, 49]. A more general notion of “tilting” has also been studied in statistics, though for very different purposes, such as importance sampling and large deviations theory [12, 3, 66] (Appendix B).

To highlight how the TERM objective can help with issues such as outliers or imbalanced classes, we discuss three motivating examples below, which are illustrated in Figure 1.

Figure 1: Toy examples illustrating TERM as a function of

: (a) finding a point estimate from a set of 2D samples, (b) linear regression with outliers, and (c) logistic regression with imbalanced classes. While positive values of

magnify outliers, negative values suppress them. Setting recovers the original ERM objective (1).

(a) Point estimation: As a first example, consider determining a point estimate from a set of samples that contain some outliers. We plot an example 2D dataset in Figure 1a, with data centered at (1,1). Using traditional ERM (i.e., TERM with ) recovers the sample mean, which can be biased towards outlier data. By setting , TERM can suppress outliers by reducing the relative impact of the largest losses (i.e., points that are far from the estimate) in (2). A specific value of can in fact approximately recover the geometric median, as the objective in (2

) can be viewed as approximately optimizing specific loss quantiles (a connection which we make explicit in Section 

2). In contrast, if these ‘outlier’ points are important to estimate, setting will push the solution towards a point that aims to minimize variance, as we prove more rigorously in Section 2, Theorem 4.

(b) Linear regression: A similar interpretation holds for the case of linear regression (Figure 2b). As TERM is able to find a solution that captures the underlying data while ignoring outliers. However, this solution may not be preferred if we have reason to believe that the outlier values should not be ignored. As TERM recovers the minimax solution, which aims to minimize the worst loss, thus ensuring the model is a reasonable fit for all samples (at the expense of possibly being a worse fit for many). Similar criteria have been used, e.g., in defining notions of fairness [56, 42]. We explore several use-cases involving robust regression and fairness in more detail in Section 5.

(c) Logistic regression: Finally, we consider a binary classification problem using logistic regression (Figure 2c). For , the TERM solution varies from the nearest cluster center (

), to the logistic regression classifier (

), towards a classifier that magnifies the misclassified data (). We note that it is common to modify logistic regression classifiers by adjusting the decision threshold from , which is equivalent to moving the intercept of the decision boundary. This is fundamentally different than what is offered by TERM (where the slope is changing). As we show in Section 5, this added flexibility affords TERM with competitive performance on a number of classification problems, such as those involving noisy data, class imbalance, or a combination of the two.

Contributions. In this work, we propose TERM as a simple, unified framework to flexibly address various challenges with empirical risk minimization. We rigorously analyze the objective in order to understand its behavior with varying , and develop efficient methods for solving TERM. Empirically, we report multiple case studies demonstrating that TERM is competitive with existing, problem-specific state-of-the-art solutions. Finally, we extend TERM to handle compound issues, such as the simultaneous existence of noisy samples and imbalanced classes. We make connections to closely related work throughout the text, and defer a more general discussion of related work to Section 6.

2 Tilted Empirical Risk Minimization: Properties & Interpretations

To better understand the performance of the -tilted losses in (2

), we provide several interpretations of the TERM solutions, leaving the full statements of theorems and proofs to the appendix. We make no distributional assumptions on the data, and study properties of TERM under the assumption that the loss function forms a generalized linear model, e.g.,

loss and logistic loss (Appendix A

). However, we also obtain favorable empirical results using TERM with other objectives such as deep neural networks and PCA in Section 

5, motivating the extension of our theory beyond GLMs in future work.

Figure 2: TERM objectives for a squared loss problem with . As moves from to , -tilted losses recover min-loss (), avg-loss (), and max-loss (), and approximate median-loss (for some ). TERM is smooth for all finite and convex for positive .

General properties. We begin by noting several general properties of the TERM objective (2). Given a smooth , the -tilted loss is smooth for all finite (Lemma 4). If is strongly convex, the -tilted loss is strongly convex for (Lemma 3). We visualize the solutions to TERM for a toy problem in Figure 2, which allows us to illustrate several special cases of the general framework. As discussed in Section 1, TERM can recover traditional ERM (), the max-loss (), and the min-loss (). As we demonstrate in Section 5, providing a smooth tradeoff between these specific losses can be beneficial for a number of practical use-cases—both in terms of the resulting solution and the difficulty of solving the problem itself. Interestingly, we additionally show that the TERM objective can be viewed as a smooth approximation to a superquantile method, which aims to minimize quantiles of losses such as the median loss. In Figure 2, it is clear to see why this may be beneficial, as the median loss (orange) can be highly non-smooth in practice. We make these rough connections more explicit via the interpretations below.

(Interpretation 1) Re-weighting samples to magnify/suppress outliers. As discussed via the toy examples in Section 1, the TERM objective can be tuned (using ) to magnify or suppress the influence of outliers. We make this notion rigorous by exploring the gradient of the -tilted loss in order to reason about the solutions to the objective defined in (2).

Lemma 1 (Tilted gradient, proof in Appendix A).

For a smooth loss function ,

(3)

From this, we can observe that the tilted gradient is a weighted average of the gradients of the original individual losses, where each data point is weighted exponentially proportional to the value of its loss. Note that recovers the uniform weighting associated with ERM, i.e., . For positive it magnifies the outliers—samples with large losses—by assigning more weight to them, and for negative it suppresses the outliers by assigning less weight to them.

(Interpretation 2) Tradeoff between average-loss and min/max-loss. To put Interpretation 1 in context and understand the limits of TERM, a benefit of the framework is that it offers a continuum of solutions between the min and max losses. Indeed, for positive values of , TERM enables a smooth tradeoff between the average-loss and max-loss (as we demonstrate in Figure 8, Appendix D). Hence, TERM can selectively improve the worst-performing losses by paying a penalty on average performance, thus promoting a notion of uniformity or fairness (Theorem 2). On the other hand, for negative the solutions achieve a smooth tradeoff between average-loss and min-loss, which can have the benefit of focusing on the ‘best’ losses, or ignoring outliers (Theorem 3).

(Interpretation 3) Bias-variance tradeoff. Another key property of the TERM solutions is that the variance of the loss across all samples decreases as increases (Theorem 4). Hence, by increasing , it is possible to trade off between optimizing the average loss vs. reducing variance, allowing the solutions to potentially achieve a better bias-variance tradeoff for generalization [39, 4, 22] (Figure 8, Appendix D). We use this property to achieve better generalization in classification in Section 5

. We also prove that the cosine similarity between the loss vector and the all-ones vector monotonically increases with

(Theorem 5), which shows that larger promotes a more uniform performance across all losses and can have implications in terms of fairness (Section 5.2).

(Interpretation 4) Approximate Value-at-Risk (VaR) or superquantile method. Finally, we show that TERM is related to superquantile-based objectives, which aim to minimize specific quantiles of the individual losses, also known as Value-at-Risk (VaR) in optimization and finance literature [53, 52]. For example, optimizing for 90% of the individual losses, ignoring the worst-performing 10%, could be a more reasonable practical objective than the pessimistic min-max objective. Another common application of this is to use the median in contrast to the mean in the presence of noisy outliers. As we discuss in Appendix B, superquantile methods can be reinterpreted as minimizing the -loss, defined as the -th smallest loss of (i.e., -loss is the min-loss, -loss is the max-loss, -loss is the median-loss). While minimizing the -loss is more desirable than ERM in many applications, the -loss (or the VaR) is non-smooth (and generally non-convex), and hence requires the use of non-smooth or difference-of-convex optimization techniques [27, 44, 45]. In Appendix B, we show that the -tilted loss provides a naturally smooth and efficiently solvable approximation of the -loss, and derive relationships between respective values of and .

3 TERM Extended: Hierarchical Multi-Objective Tilting

Here we consider an extension of TERM that can be used to address practical applications requiring multiple objectives, e.g., simultaneously achieving robustness to noisy data and ensuring fair performance across subgroups. Existing approaches typically aim to address such problems in isolation. To handle multiple objectives with TERM, let each sample be associated with a group i.e., These groups could be related to the labels (e.g., classes in a classification task), or may depend only on features. For any we define multi-objective TERM as:

(4)

and is the size of group . Multi-objective TERM recovers TERM as a special case for (Appendix, Lemma 6). Similar to the tilted gradient (3), the multi-objective tilted gradient is a weighted sum of the gradients (Appendix, Lemma 5), making it similarly efficient to solve.

In a subset of our experiments in Section 5, we perform a pure group-level tilting without sample-level tilting, which corresponds to In Section 5.1, we consider grouping based on the identity of the annotator who provides the label associated with each sample to mitigate the different annotation qualities across individual annotators. In the classification experiments of Section 5.2

, we perform group-level tilting based on the target class label associated with the classification problem. In the fair principal component analysis (PCA) experiment in Section 

5.2, we perform grouping based on a sensitive attribute (education level in this experiment) so that we can ensure a fair performance across all groups. Finally, we validate the effectiveness of hierarchical tilting empirically in Section 5.3 for a hierarchy of depth two, where we show that TERM can significantly outperform baselines to handle class imbalance and noisy outliers simultaneously. Note that hierarchical tilting could be extended to hierarchies of greater depths to simultaneously handle more than two objectives at the cost of one extra hyperparameter per each additional optimization objective.

4 Solving TERM

While the main focus of this work is in understanding properties of the TERM objective and its minimizers, we also provide first-order optimization methods for solving TERM (explained in detail in Appendix C), and explore the effect that has on the convergence of these methods.

First-order methods. To solve TERM, we suggest batch and stochastic variants of traditional gradient-based methods (Appendix C, Algorithms 1 and 2), which are presented in the context of solving multi-objective hierarchical TERM (4) for full generality. At a high level, in the stochastic case, at each iteration, group-level tilting is addressed by choosing a group based on the corresponding group-level tilted weight vector. Sample-level tilting is then incorporated by re-weighting the samples in a uniformly drawn mini-batch based on their sample-level weights, where we track these weights via stochastic dynamics. We find that these methods perform well empirically on a variety of tasks (Section 5), and comment below on general properties of TERM (smoothness, convexity) that may vary with and affect the convergence of such methods.

Figure 3: As , the objective becomes less smooth in the vicinity of the final solution, hence suffering from slower convergence. For negative values of , TERM converges quickly due to the smoothness in the vicinity of solutions despite its non-convexity.

Convergence with . First, we note that -tilted losses are -smooth for all . In a small neighborhood around the tilted solution, is bounded for all negative and moderately positive , whereas it scales linearly with as , which has been previously studied in the context of exponential smoothing of the max [31, 49]. We prove this formally in Appendix A, Lemma 4, but it can also be observed visually via the toy example in Figure 2. Hence, solving TERM to a local optimum using gradient-based methods will tend to be as efficient as traditional ERM for small-to-moderate values of  [26], which we corroborate via experiments on multiple real-world datasets in Section 5. This is in contrast to solving for the minimax solution, which would be similar to solving TERM as  [31, 49, 47].

Second, recall that the -tilted loss remains strongly convex for so long as the original loss function is strongly convex. On the other hand, for sufficiently large negative the -tilted loss becomes non-convex. Hence, while the -tilted solutions for positive are unique, the objective may have multiple (spurious) local minima for negative even if the original loss function is strongly convex. For negative , we seek the solution for which the parametric set of -tilted solutions obtained by sweeping remains continuous (as in Figure 1a-c). To this end, for negative , we solve TERM by smoothly decreasing from ensuring that the solutions form a continuum in . Despite the non-convexity of TERM with , we find that this approach produces effective solutions to multiple real-world problems in Section 5. Additionally, as the objective remains smooth, it is still relatively efficient to solve.

5 TERM in Practice: Use Cases

In this section, we showcase the flexibility, wide applicability, and competitive performance of the TERM framework through empirical results on a variety of real-world problems such as handling outliers (Section 5.1), ensuring fairness and improving generalization (Section 5.2), and addressing compound issues (Section 5.3). Despite the relatively straightforward modification TERM makes to traditional ERM, we show that -tilted losses not only outperform ERM, but either outperform or are competitive with state-of-the-art, problem-specific tailored baselines on a wide range of applications.

We provide implementation details in Appendix E. All code, datasets, and experiments are publicly available at github.com/litian96/TERM. For experiments with positive (Section  5.2), we tune on the validation set. For experiments involving negative (Section 5.1 and Section 5.3), we choose across all experiments since we assume that a validation set with clean data is not available. For all values of tested, we observe that the number of iterations required to solve TERM is within 2

that of standard ERM. In the tables provided throughout, we highlight the top methods by marking all solutions that are within standard error of the best solution in bold.

5.1 Mitigating Noisy Outliers ()

We begin by investigating TERM’s ability to find robust solutions that reduce the effect of noisy outliers. We note that we specifically focus on the setting of ‘robustness’ involving random additive noise; the applicability of TERM to more adversarial forms of robustness would be an interesting direction of future work. For a fair comparison, we do not compare with approaches that require additional clean validation data [e.g., 54, 63, 21, 50], as such data can be costly to obtain in practice.

Robust regression. We first consider a regression task with noise corrupted targets, where we aim to minimize the root mean square error (RMSE) on samples from the Drug Discovery dataset [46, 13]. The task is to predict the bioactivities given a set of chemical compounds. We compare against linear regression with an loss, which we view as the ‘standard’ ERM solution for regression, as well as with losses that are commonly used to mitigate outliers—the loss and Huber loss [23]. We also compare with consistent robust regression (CRR) [6], a recent state-of-the-art method designed for the problem of robust regression. We apply TERM at the sample level with an loss, and generate noisy outliers by assigning random targets drawn from on a fraction of the samples. In Table 2, we report RMSE on clean test data for each objective and under different noise levels. We also present the performance of an oracle method (Genie ERM) which has access to all of the clean data samples with the noisy samples removed. Note that Genie ERM is not a practical algorithm and is solely presented to set the expected performance limit in the noisy setting. The results indicate that TERM is competitive with baselines on the 20% noise level, and achieves better robustness with moderate-to-extreme noise. We observe similar trends in scenarios involving both noisy features and targets (Appendix D.2). In terms of runtime, solving TERM is roughly as efficient as ERM, while CRR tends to run slowly as it scales cubicly with the number of dimensions [6].

objectives test RMSE (Drug Discovery)
20% noise 40% noise 80% noise
ERM 1.87 (.05) 2.83 (.06) 4.74 (.06)
1.15 (.07) 1.70 (.12) 4.78 (.08)
Huber [23] 1.16 (.07) 1.78 (.11) 4.74 (.07)
CRR [6] 1.10 (.07) 1.51 (.08) 4.07 (.06)
TERM 1.08 (.05) 1.10 (.04) 1.68 (.03)
Genie ERM 1.02 (.04) 1.07 (.04) 1.04 (.03)
Table 2: TERM is competitive with robust classification baselines, and is superior in high noise regimes.
objectives test accuracy (CIFAR-10, Inception)
20% noise 40% noise 80% noise
ERM 0.775 (.004) 0.719 (.004) 0.284 (.004)
RandomRect [50] 0.744 (.004) 0.699 (.005) 0.384 (.005)
SelfPaced [33] 0.784 (.004) 0.733 (.004) 0.272 (.004)
MentorNet-PD [25] 0.798 (.004) 0.731 (.004) 0.312 (.005)
GCE [74] 0.805 (.004) 0.750 (.004) 0.433 (.005)
TERM 0.795 (.004) 0.768 (.004) 0.455 (.005)
Genie ERM 0.828 (.004) 0.820 (.004) 0.792 (.004)
Table 1: TERM is competitive with robust regression baselines, and is superior in high noise regimes.

Robust classification. It is well-known that deep neural networks can easily overfit to corrupted labels [e.g., 73]. While the theoretical properties we study for TERM (Section 2) do not directly cover objectives with neural network function approximations, we show that TERM can be applied empirically to DNNs to achieve robustness to noisy training labels. MentorNet [25] is a popular method in this setting, which learns to assign weights to samples based on feedback from a student net. Following the setup in [25], we explore classification on CIFAR-10 [32] when a fraction of the training labels are corrupted with uniform noise—comparing TERM with ERM and several state-of-the-art approaches [33, 50, 74, 32]. As shown in Table 2, TERM performs competitively with 20% noise, and outperforms all baselines in the high noise regimes. Here we use MentorNet-PD as a baseline since it does not require clean validation data. However, in Appendix D.2, we show that TERM can in fact match the performance of MentorNet-DD, which requires clean validation data.

Low-quality annotators.

It is not uncommon for practitioners to obtain human-labeled data for their learning tasks from crowd-sourcing platforms. However, these labels are usually noisy in part due to the varying quality of the human annotators. Given a collection of labeled samples from crowd-workers, we aim to learn statistical models that are robust to the potentially low-quality annotators. As a case study, following the setup of Khetan et al. [30], we take the CIFAR-10 dataset and simulate 100 annotators where 20 of them are hammers (i.e., always correct) and 80 of them are spammers (i.e., assigning labels uniformly at random). We apply TERM at the annotator group level in (4), which is equivalent to assigning annotator-level weights based on the aggregate value of their loss. As shown in Figure 4, TERM is able to achieve the test accuracy limit set by Genie ERM, i.e., the ideal performance obtained by completely removing the known outliers. We note in particular that the accuracy reported by Khetan et al. [30] (0.777) is lower than TERM (0.825) in the same setup, even though their approach is a two-pass algorithm requiring at least double the training time. We provide full empirical details and investigate additional noisy annotator scenarios in Appendix D.2.

Figure 4: TERM () completely removes the impact of noisy annotators, reaching the performance limit set by Genie ERM. Figure 5: TERM-PCA flexibly trades the performance on the high (H) edu group for the performance on the low (L) edu group. Figure 6: TERM () is competitive with state-of-the-art methods for classification with imbalanced classes.

5.2 Fairness and Generalization ()

In this section, we show that positive values of in TERM can help promote fairness (e.g., via learning fair representations), and offer variance reduction for better generalization.

Fair principal component analysis (PCA). We explore the flexibility of TERM in learning fair representations using PCA. In fair PCA, the goal is to learn low-dimensional representations which are fair to all considered subgroups (e.g., yielding similar reconstruction errors) [56, 62, 28]. Despite the non-convexity of the fair PCA problem, we apply TERM to this task, referring to the resulting objective as TERM-PCA. We tilt the same loss function as in Samadi et al. [56]:

where is a subset (group) of data, is the current projection, and is the optimal rank- approximation of . Instead of solving a more complex min-max problem using semi-definite programming as in Samadi et al. [56], which scales poorly with problem dimension, we apply gradient-based methods, re-weighting the gradients at each iteration based on the loss on each group. In Figure 5, we plot the aggregate loss for two groups (high vs. low education) in the Default Credit dataset [70] for different target dimensions . By varying in TERM, we achieve varying degrees of performance improvement on different groups—TERM () effectively recovers the min-max results of Samadi et al. [56] by forcing the losses on both groups to be (almost) identical, while TERM () offers the flexibility of reducing the performance gap less aggressively.

Handling class imbalance.

Next, we show that TERM can reduce the performance variance across classes with extremely imbalanced data when training deep neural networks. We compare TERM with several baselines which re-weight samples during training, including focal loss [37], HardMine [38], and LearnReweight [50]. Following Ren et al. [50], the datasets are composed of imbalanced and digits from MNIST [35]. From Figure 6, we see that TERM obtains similar (or higher) final accuracy on the clean test data as the state-of-the-art methods. We also note that compared with LearnReweight, which optimizes the model over an additional balanced validation set and requires three gradient calculations for each update, TERM neither requires such balanced validation data nor does it increase the per-iteration complexity.

Improving generalization via variance reduction. A common alternative to ERM is to consider a distributionally robust objective, which optimizes for the worst-case training loss over a set of distributions, and has been shown to offer variance-reduction properties that benefit generalization [e.g., 43, 59]. While not directly developed for distributional robustness, TERM also enables variance reduction for positive values of (Theorem 4), which can be used to strike a better bias-variance tradeoff for generalization. We compare TERM (applied at the class-level as in (4), with logistic loss) with robustly regularized risk (RobustRegRisk) as in [43] on the HIV-1 [55, 15] dataset originally investigated by Namkoong and Duchi [43]. We examine the accuracy on the rare class (), the common class (), and overall accuracy.

objectives test accuracy (HIV-1)
overall
ERM 0.822 (.009) 0.966 (.002) 0.934 (.003)
Linear SVM 0.838 (.013) 0.964 (.002) 0.937 (.004)
LearnReweight [50] 0.841 (.014) 0.961 (.004) 0.934 (.004)
FocalLoss [37] 0.834 (.013) 0.966 (.003) 0.937 (.004)
RobustRegRisk [43] 0.844 (.010) 0.966 (.003) 0.939 (.004)
TERM () 0.844 (.011) 0.964 (.003) 0.937 (.003)
ERM (thresh = 0.26) 0.916 (.008) 0.917 (.003) 0.917 (.002)
RobustRegRisk (thresh=0.49) 0.917 (.005) 0.928 (.002) 0.924 (.001)
TERM () 0.919 (.004) 0.926 (.003) 0.924 (.002)
Table 3: TERM () is competitive with strong baselines in generalization. TERM () outperforms ERM (with decision threshold changed for providing fairness) and is competitive with RobustRegRisk with no need for extra hyperparameter tuning.

The mean and standard error of accuracies are reported in Table 3. RobustRegRisk and TERM offer similar performance improvements compared with other baselines, such as linear SVM, FocalLoss [37], and LearnRewight [50]. For larger , TERM achieves similar accuracy in both classes, while RobustRegRisk does not show similar trends by sweeping its hyperparameters. It is common to adjust the decision threshold to boost the accuracy on the rare class. We do this for ERM and RobustRegRisk and optimize the threshold so that ERM and RobustRegRisk result in the same validation accuracy on the rare class as TERM (). TERM achieves similar performance to RobustRegRisk without the need for an extra tuned hyperparameter.

5.3 Solving Compound Issues: Hierarchical Multi-Objective Tilting

In this section, we focus on settings where multiple issues, e.g., class imbalance and label noise, exist in the data simultaneously. We discuss two possible instances of hierarchical multi-objective TERM to tackle such problems. One can think of other variants in this hierarchical tilting space which could be useful depending on applications at hand. However, we are not aware of other prior work that aims to simultaneously handle multiple goals, e.g., suppressing noisy samples and addressing class imbalance, in a unified framework without additional validation data.

We explore the HIV-1 dataset [55], as in Section 5.2. We report both overall accuracy and accuracy on the rare class in four separate scenarios: (a) clean and 1:4, which is the original dataset that is naturally slightly imbalanced with rare samples represented 1:4 with respect to the common class; (b) clean and 1:20, where we subsample to introduce a 1:20 imbalance ratio; (c) noisy and 1:4, which is the original dataset with labels associated with 30% of the samples randomly reshuffled; and (d) noisy and 1:20, where 30% of the labels of the 1:20 imbalanced dataset are reshuffled.

objectives test accuracy (HIV-1)
clean data 30% noise
1:4 1:20 1:4 1:20
overall overall overall overall
ERM 0.822 (.009) 0.934 (.003) 0.503 (.013) 0.888 (.006) 0.656 (.014) 0.911 (.006) 0.240 (.018) 0.831 (.011)
GCE [74] 0.822 (.009) 0.934 (.003) 0.503 (.013) 0.888 (.006) 0.732 (.021) 0.925 (.005) 0.324 (.017) 0.849 (.008)
LearnReweight [50] 0.841 (.014) 0.934 (.004) 0.800 (.022) 0.904 (.003) 0.721 (.034) 0.856 (.008) 0.532 (.054) 0.856 (.013)
RobustRegRisk [43] 0.844 (.010) 0.939 (.004) 0.622 (.011) 0.906 (.005) 0.634 (.014) 0.907 (.006) 0.051 (.014) 0.792 (.012)
FocalLoss [37] 0.834 (.013) 0.937 (.004) 0.806 (.020) 0.918 (.003) 0.638 (.008) 0.908 (.005) 0.565 (.027) 0.890 (.009)
TERM 0.844 (.011) 0.937 (.003) 0.837 (.017) 0.922 (.003) 0.847 (.010) 0.920 (.004) 0.740 (.010) 0.907 (.004)
TERM 0.843 (.012) 0.937 (.004) 0.831 (.021) 0.920 (.002) 0.846 (.017) 0.934 (.005) 0.804 (.016) 0.916 (.003)
Table 4: Hierarchical TERM can address both class imbalance and noisy samples.

In Table 4, hierarchical TERM is applied at the sample level and class level (TERM), where we use the sample-level tilt of for noisy data. We use class-level tilt of for the 1:4 case and for the 1:20 case. We compare against baselines for robust classification and class imbalance (discussed previously in Sections 5.1 and 5.2), where we tune them for best performance (Appendix E). Similar to the experiments in Section 5.1, we avoid using baselines that require clean validation data [e.g., 54]. While different baselines perform well in their respective problem settings, TERM is far superior to all baselines when considering noisy samples and class imbalance simultaneously (rightmost column in Table 4). Finally, in the last row of Table 4, we simulate the noisy annotator setting of Section 5.1 assuming that the data is coming from 10 annotators, i.e., in the 30% noise case we have 7 hammers and 3 spammers. In this case, we apply hierarchical TERM at both class and annotator levels (TERM), where we perform the higher level tilt at the annotator (group) level and the lower level tilt at the class level (with no sample-level tilting). We show that this approach can benefit noisy/imbalanced data even further (far right, Table 4), while suffering only a small performance drop on the clean and noiseless data (far left, Table 4).

6 Related Work

Alternate aggregation schemes: exponential smoothing/superquantile methods.

A common alternative to the standard average loss in empirical risk minimization is to consider a minimax objective, which aims to minimize the max-loss. Minimax objectives are commonplace in machine learning, and have been used for a wide range of applications, such as ensuring fairness across subgroups 

[42, 60, 56, 62, 20], enabling robustness under small perturbations [59], or generalizing to unseen domains [64]. As discussed in Section 2, the TERM objective can be viewed as a minimax smoothing [31, 49] with the added flexibility of a tunable to allow the user to optimize utility for different quantiles of loss similar to superquantile approaches [53, 52, 34, 44], directly trading off between robustness/fairness and utility for positive and negative values of (see Appendix B for these connections). However, the TERM objective remains smooth (and efficiently solvable) for moderate values of , resulting in faster convergence even when the resulting solutions are effectively the same as the min-max solution or other desired quantiles of the loss (as we demonstrate in the experiments of Section 5). Interestingly, Cohen et al. [10, 11] introduce Simnets, with a similar exponential smoothing operator, though for a differing purpose of flexibly achieving layer operations between sum and max in deep neural networks.

Alternate loss functions. Rather than modifying the way the losses are aggregated, as in (smoothed) minimax or superquantile methods, it is also quite common to modify the losses themselves. For example, in robust regression, it is common to consider losses such as the loss, Huber loss, or general -estimators as a way to mitigate the effect of outliers [5]. Losses can also be modified to address outliers by favoring small losses [71, 74]

or gradient clipping 

[41]. On the other extreme, the largest losses can be magnified in order to encourage focus on hard samples [37, 67, 36], which is a popular approach for curriculum learning. Constraints could also be imposed to promote fairness [19, 14, 51, 72, 2]. Ignoring the log portion of the objective in (2), TERM can in fact be viewed as an alternate loss function exponentially shaping the loss to achieve both of these goals with a single objective, i.e., magnifying hard examples with and suppressing outliers with . In addition, we show that TERM can even achieve both goals simultaneously with hierarchical multi-objective optimization (Section 5.3).

Sample re-weighting schemes. Finally, there exist approaches that implicitly modify the underlying ERM objective by re-weighting the influence of the samples themselves. These re-weighting schemes can be enforced in many ways. A simple and widely used example is to subsample training points in different classes. Alternatively one can re-weight examples according to their loss function when using a stochastic optimizer, which can be used to put more emphasis on “hard” examples [57, 24, 29]. Re-weighting can also be implicitly enforced via the inclusion of a regularization parameter [1], loss clipping [69], or modelling of crowd-worker qualities [30], which can make the objective more robust to rare instances. Such an explicit re-weighting has also been explored for other applications [e.g., 37, 25, 58, 9, 17, 50], though in contrast to these methods, TERM is applicable to a general class of loss functions, with theoretical guarantees. TERM is equivalent to a dynamic re-weighting of the samples based on the values of the objectives (Lemma 1), which could be viewed as a convexified version of loss clipping. We compare to several sample re-weighting schemes empirically in Section 5.

7 Conclusion

In this paper, we introduced tilted empirical risk minimization (TERM) as a flexible alternative to ERM. We explored, both theoretically and empirically, TERM’s ability to handle various known issues with ERM, such as robustness to noise in regression/classification, class imbalance, fairness, and generalization. Our theoretical analyses provide insight into the behavior and applicability of TERM for various values of . We additionally extended TERM to address compound issues like the simultaneous existence of class imbalance and noisy outliers. Despite the straightforward modification TERM makes to traditional ERM objectives, the framework consistently outperforms ERM and delivers competitive performance with state-of-the-art, problem-specific methods on a wide range of applications. The simplicity of TERM also affords many practical benefits—for example, training times for TERM ran within 2x of the ERM baseline in all of our experiments, and in contrast to many state-of-the-art methods, TERM does not require clean validation data, which can be costly to obtain. In future work, it would be interesting to gain a deeper theoretical understanding of TERM on objectives beyond GLMs, and to explore applications of TERM on additional learning problems.

Acknowledgements

We are grateful to Arun Sai Suggala and Adarsh Prasad (CMU) for their helpful comments on robust regression; to Zhiguang Wang, Dario Garcia Garcia, Alborz Geramifard, and other members of Facebook AI for productive discussions and feedback and pointers to prior work [10, 11, 67, 53]; and to Meisam Razaviyayn (USC) for helpful discussions and pointers to exponential smoothing [31, 49], Value-at-Risk [52, 44], and general properties of gradient-based methods in non-convex optimization problems [26, 27, 18, 47]. The work of TL and VS was supported in part by the National Science Foundation grant IIS1838017, a Google Faculty Award, a Carnegie Bosch Institute Research Award, and the CONIX Research Center. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the National Science Foundation or any other funding agency.

References

  • Abdelkarim et al. [2020] S. Abdelkarim, P. Achlioptas, J. Huang, B. Li, K. Church, and M. Elhoseiny. Long-tail visual relationship recognition with a visiolinguistic hubless loss. arXiv preprint arXiv:2004.00436, 2020.
  • Baharlouei et al. [2020] S. Baharlouei, M. Nouiehed, A. Beirami, and M. Razaviyayn. Rényi fair inference. International Conference on Learning Representations, 2020.
  • Beirami et al. [2019] A. Beirami, R. Calderbank, M. M. Christiansen, K. R. Duffy, and M. Médard. A characterization of guesswork on swiftly tilting curves. IEEE Transactions on Information Theory, 2019.
  • Bennett [1962] G. Bennett. Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association, 1962.
  • Bhatia et al. [2015] K. Bhatia, P. Jain, and P. Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, 2015.
  • Bhatia et al. [2017] K. Bhatia, P. Jain, P. Kamalaruban, and P. Kar. Consistent robust regression. In Advances in Neural Information Processing Systems, 2017.
  • Boyd et al. [2004] S. Boyd, S. P. Boyd, and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Bubeck [2015] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 2015.
  • Chang et al. [2017] H.-S. Chang, E. Learned-Miller, and A. McCallum. Active bias: Training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems, 2017.
  • Cohen and Shashua [2014] N. Cohen and A. Shashua. Simnets: A generalization of convolutional networks. arXiv preprint arXiv:1410.0781, 2014.
  • Cohen et al. [2016] N. Cohen, O. Sharir, and A. Shashua. Deep simnets. In

    Conference on Computer Vision and Pattern Recognition

    , 2016.
  • Dembo and Zeitouni [2009] A. Dembo and O. Zeitouni. Large deviations techniques and applications. Springer Science & Business Media, 2009.
  • Diakonikolas et al. [2019] I. Diakonikolas, G. Kamath, D. Kane, J. Li, J. Steinhardt, and A. Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, 2019.
  • Donini et al. [2018] M. Donini, L. Oneto, S. Ben-David, J. S. Shawe-Taylor, and M. Pontil. Empirical risk minimization under fairness constraints. In Advances in Neural Information Processing Systems, 2018.
  • Dua and Graff [2019] D. Dua and C. Graff. UCI machine learning repository [http://archive. ics. uci. edu/ml]. https://archive. ics. uci. edu/ml/datasets. 2019.
  • Duarte and Hu [2004] M. F. Duarte and Y. H. Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 2004.
  • Gao et al. [2015] J. Gao, H. Jagadish, and B. C. Ooi. Active sampler: Light-weight accelerator for complex data analytics at scale. arXiv preprint arXiv:1512.03880, 2015.
  • Ge et al. [2015] R. Ge, F. Huang, C. Jin, and Y. Yuan.

    Escaping from saddle points—online stochastic gradient for tensor decomposition.

    In Conference on Learning Theory, 2015.
  • Hardt et al. [2016] M. Hardt, E. Price, and N. Srebro.

    Equality of opportunity in supervised learning.

    In Advances in Neural Information Processing Systems, 2016.
  • Hashimoto et al. [2018] T. Hashimoto, M. Srivastava, H. Namkoong, and P. Liang. Fairness without demographics in repeated loss minimization. In International Conference on Machine Learning, 2018.
  • Hendrycks et al. [2018] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in Neural Information Processing Systems, 2018.
  • Hoeffding [1994] W. Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding. 1994.
  • Huber [1964] P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 1964.
  • Jiang et al. [2019] A. H. Jiang, D. L.-K. Wong, G. Zhou, D. G. Andersen, J. Dean, G. R. Ganger, G. Joshi, M. Kaminksy, M. Kozuch, Z. C. Lipton, et al. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762, 2019.
  • Jiang et al. [2018] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, 2018.
  • Jin et al. [2017] C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, 2017.
  • Jin et al. [2019] C. Jin, P. Netrapalli, and M. I. Jordan. Minmax optimization: Stable limit points of gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.
  • Kamani et al. [2019] M. M. Kamani, F. Haddadpour, R. Forsati, and M. Mahdavi. Efficient fair principal component analysis. arXiv preprint arXiv:1911.04931, 2019.
  • Katharopoulos and Fleuret [2017] A. Katharopoulos and F. Fleuret. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043, 2017.
  • Khetan et al. [2018] A. Khetan, Z. C. Lipton, and A. Anandkumar. Learning from noisy singly-labeled data. In International Conference on Learning Representations, 2018.
  • Kort and Bertsekas [1972] B. W. Kort and D. P. Bertsekas. A new penalty function method for constrained minimization. In IEEE Conference on Decision and Control and 11th Symposium on Adaptive Processes, 1972.
  • Krizhevsky et al. [2009] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Kumar et al. [2010] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, 2010.
  • Laguel et al. [2020] Y. Laguel, K. Pillutla, J. Malick, and Z. Harchaoui. Device heterogeneity in federated learning: A superquantile approach. arXiv preprint arXiv:2002.11223, 2020.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
  • Li et al. [2020] T. Li, M. Sanjabi, A. Beirami, and V. Smith. Fair resource allocation in federated learning. In International Conference on Learning Representations, 2020.
  • Lin et al. [2017] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In International Conference on Computer Vision, 2017.
  • Malisiewicz et al. [2011] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplar-SVMs for object detection and beyond. In International Conference on Computer Vision, 2011.
  • Maurer and Pontil [2009] A. Maurer and M. Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  • McMahan et al. [2017] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas. Communication-efficient learning of deep networks from decentralized data. In

    International Conference on Artificial Intelligence and Statistics

    , 2017.
  • Menon et al. [2020] A. K. Menon, A. S. Rawat, S. J. Reddi, and S. Kumar. Can gradient clipping mitigate label noise? In International Conference on Learning Representations, 2020.
  • Mohri et al. [2019] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In International Conference on Machine Learning, 2019.
  • Namkoong and Duchi [2017] H. Namkoong and J. C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, 2017.
  • Nouiehed et al. [2019a] M. Nouiehed, J.-S. Pang, and M. Razaviyayn. On the pervasiveness of difference-convexity in optimization and statistics. Mathematical Programming, 2019a.
  • Nouiehed et al. [2019b] M. Nouiehed, M. Sanjabi, T. Huang, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, 2019b.
  • Olier et al. [2018] I. Olier, N. Sadawi, G. R. Bickerton, J. Vanschoren, C. Grosan, L. Soldatova, and R. D. King. Meta-qsar: a large-scale application of meta-learning to drug design and discovery. Machine Learning, 2018.
  • Ostrovskii et al. [2020] D. M. Ostrovskii, A. Lowy, and M. Razaviyayn. Efficient search of first-order Nash equilibria in nonconvex-concave smooth min-max problems. arXiv preprint arXiv:2002.07919, 2020.
  • Pace and Barry [1997] R. K. Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 1997.
  • Pee and Royset [2011] E. Pee and J. O. Royset. On solving large-scale finite minimax problems using exponential smoothing. Journal of Optimization Theory and Applications, 2011.
  • Ren et al. [2018] M. Ren, W. Zeng, B. Yang, and R. Urtasun.

    Learning to reweight examples for robust deep learning.

    In International Conference on Machine Learning, 2018.
  • Rezaei et al. [2019] A. Rezaei, R. Fathony, O. Memarrast, and B. Ziebart. Fair logistic regression: An adversarial perspective. arXiv preprint arXiv:1903.03910, 2019.
  • Rockafellar and Uryasev [2002] R. T. Rockafellar and S. Uryasev. Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 2002.
  • Rockafellar et al. [2000] R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value-at-risk. Journal of Risk, 2000.
  • Roh et al. [2020] Y. Roh, K. Lee, S. E. Whang, and C. Suh. Fr-train: A mutual information-based approach to fair and robust training. In International Conference on Machine Learning, 2020.
  • Rögnvaldsson [2013] T. Rögnvaldsson. UCI repository of machine learning databases. https://archive.ics.uci.edu/ml/datasets/HIV-1+protease+cleavage, 2013.
  • Samadi et al. [2018] S. Samadi, U. Tantipongpipat, J. H. Morgenstern, M. Singh, and S. Vempala. The price of fair PCA: One extra dimension. In Advances in Neural Information Processing Systems, 2018.
  • Shrivastava et al. [2016] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Conference on Computer Vision and Pattern Recognition, 2016.
  • Shu et al. [2019] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In Advances in Neural Information Processing Systems, 2019.
  • Sinha et al. [2018] A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018.
  • Stelmakh et al. [2019] I. Stelmakh, N. B. Shah, and A. Singh. Peerreview4all: Fair and accurate reviewer assignment in peer review. In Algorithmic Learning Theory, 2019.
  • Szegedy et al. [2016] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition, 2016.
  • Tantipongpipat et al. [2019] U. Tantipongpipat, S. Samadi, M. Singh, J. H. Morgenstern, and S. Vempala. Multi-criteria dimensionality reduction with applications to fairness. In Advances in Neural Information Processing Systems, 2019.
  • Veit et al. [2017] A. Veit, N. Alldrin, G. Chechik, I. Krasin, A. Gupta, and S. Belongie. Learning from noisy large-scale datasets with minimal supervision. In Conference on Computer Vision and Pattern Recognition, 2017.
  • Volpi et al. [2018] R. Volpi, H. Namkoong, O. Sener, J. C. Duchi, V. Murino, and S. Savarese. Generalizing to unseen domains via adversarial data augmentation. In Advances in Neural Information Processing Systems, 2018.
  • Wainwright and Jordan [2008] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 2008.
  • Wainwright et al. [2005] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 2005.
  • Wang et al. [2016] Z. Wang, T. Oates, and J. Lo. Adaptive normalized risk-averting training for deep neural networks. In AAAI Conference on Artificial Intelligence, 2016.
  • Weyl [1912] H. Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 1912.
  • Yang et al. [2010] M. Yang, L. Xu, M. White, D. Schuurmans, and Y.-l. Yu. Relaxed clipping: A global training method for robust regression and classification. In Advances in Neural Information Processing Systems, 2010.
  • Yeh and Lien [2009] I.-C. Yeh and C.-h. Lien. The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications, 2009.
  • Yu et al. [2012] Y.-l. Yu, Ö. Aslan, and D. Schuurmans. A polynomial-time form of robust regression. In Advances in Neural Information Processing Systems, 2012.
  • Zafar et al. [2017] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi. Fairness beyond disparate treatment & disparate impact: Learning classification without disparate mistreatment. In Conference on World Wide Web, 2017.
  • Zhang et al. [2017] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
  • Zhang and Sabuncu [2018] Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In Advances in Neural Information Processing Systems, 2018.

Appendix A Properties of TERM: Full Statements and Proofs

In this section we first provide assumptions that are used throughout our theoretical analyses (Appendix A.1). We then state general properties of the TERM objective (Appendix A.2) and properties of hierarchical multi-objective TERM (Appendix A.3). Finally, we present our main results that concern the properties of the solutions of TERM for generalized linear models (Appendix A.4).

a.1 Assumptions

The results in this paper are derived under one of the following four assumptions:

Assumption 1 (Smoothness condition).

We assume that for loss function is in differentiability class (i.e., continuously differentiable) with respect to

Assumption 2 (Strong convexity condition).

We assume that Assumption 1 is satisfied. In addition, we assume that for any , is in differentiability class (i.e., twice differentiable with continuous Hessian) with respect to . We further assume that there exist such that for and any

(5)

where

is the identity matrix of appropriate size (in this case

). We further assume that there does not exist any such that for all

Assumption 3 (Generalized linear model condition [65]).

We assume that Assumption 2 is satisfied. We further assume that the loss function is given by

(6)

where is a convex function such that there exists such that for any

(7)

We also assume that

(8)

This nest set of assumptions become the most restrictive with Assumption 3, which essentially requires that the loss be the negative log-likelihood of an exponential family. While the assumption is stated using the natural parameter of an exponential family for ease of presentation, the results hold for a bijective and smooth reparameterization of the exponential family. Assumption 3 is satisfied by the commonly used loss for regression and logistic loss for classification (see toy examples (b) and (c) in Figure 1). While the assumption is not satisfied when we use neural network function approximators in Section 5.1, we observe favorable numerical results motivating the extension of these results beyond the cases that are theoretically studied in this paper.

In the sequel, many of the results are concerned with characterizing the -tilted solutions defined as the parametric set of solutions of -tiled losses by sweeping ,

(9)

where is an open subset of We state an assumption on this set below.

Assumption 4 (Strict saddle property (Definition 4 in [18])).

We assume that the set is non-empty for all . Further, we assume that for all is a “strict saddle” as a function of , i.e., for all local minima, , and for all other stationary solutions, , where

is the minimum eigenvalue of the matrix.

We use the strict saddle property in order to reason about the properties of the -tilted solutions. In particular, since we are solely interested in the local minima of the strict saddle property implies that for every for a sufficiently small , for all

(10)

where denotes a -ball of radius around

We will show later that the strict saddle property is readily verified for under Assumption 2.

a.2 General properties of the TERM objective

Proof of Lemma 1.

Lemma 1, which provides the gradient of the tilted objective, has been studied previously in the context of exponential smoothing (see [49, Proposition 2.1]). We provide a brief derivation here under Assumption 1 for completeness. We have:

(11)
(12)

Lemma 2.

Under Assumption 1,

(13)
(14)
(15)

where is the max-loss and is the min-loss:

(16)
Proof.

For

(17)
(18)

where (17) is due to L’Hôpital’s rule.

For , we proceed as follows:

(19)
(20)

On the other hand,

(21)
(22)
(23)

Hence, the proof follows by putting together (20) and (23).

The proof proceeds similarly to for and is omitted for brevity. ∎

Note that Lemma 2 has been previously observed in [10]. This lemma also implies that is the ERM solution, is the min-max solution, and is the min-min solution.

Lemma 3 (Tilted Hessian and strong convexity for ).

Under Assumption 2, for any

(24)
(25)

In particular, for all and all the -tilted objective is strongly convex. That is

(26)
Proof.

Recall that

(27)
(28)

The proof of the first part is completed by differentiating again with respect to followed by algebraic manipulation.

To prove the second part, notice that for the term in (24) is positive semi-definite, whereas the term in (25) is positive definite and lower bounded by (see Assumption 2, Eq. (5)). Hence, the proof is completed by invoking Weyl’s inequality [68] on the smallest eigenvalue of the sum of two Hermitian matrices. ∎

Note that Pee and Royset [49, Lemma 3.1] directly implies Lemma 3, and the proof is provided here for completeness. Further note that the convexity of the tilted Hessian would be directly resulted from the vector composition theorem (cf. [7, Page 111]). However, the second part of the lemma on the strong convexity parameter would not be implied by the vector composition theorem.

Further notice that Lemma 3 also implies that under Assumption 2, the strict saddle property (Assumption 4) is readily verified for

Lemma 4 (Smoothness of in the vicinity of the final solution ).

For any , let be the smoothness parameter in the vicinity of the final solution:

(29)

where is the Hessian of at , denotes the largest eigenvalue, and denotes a -ball of radius around Under Assumption 2, for any is a -smooth function of . Further, for at the vicinity of ,

(30)

and for

(31)
Proof.

Let us first provide a proof for . Invoking Lemma 3 and Weyl’s inequality [68], we have

(32)
(33)
(34)

where we have used the fact that the term in (24) is negative semi-definite for , and that the term in (25) is positive definite for all with smoothness bounded by (see Assumption 2, Eq. (5)).

For following Lemma 3 and Weyl’s inequality [68], we have

(35)
(36)

Consequently,

(37)

On the other hand, following Weyl’s inequality [68],

(38)

and hence,

(39)

where we have used the fact that no solution exists that would make all ’s vanish (Assumption 2). ∎

Under the strict saddle property (Assumption 4), it is known that gradient-based methods would converge to a local minimum [18], i.e., would be obtained using gradient descent (GD). The rate of convergence of GD scales linearly with the smoothness parameter of the optimization landscape, which is characterized by Lemma 4 (cf. [8, Section 3]). As the smoothness parameter remains bounded for we expect that solving TERM for. would be computationally similar to solving ERM. However, as the smoothness parameter scales linearly with implying that solving TERM becomes more difficult by increasing . This is expected from the non-smoothness of TERM at the vicinity of the final min-max solution (see also Figure 2 for a visual verification).

a.3 Properties of hierarchical multi-objective tilting

Lemma 5 (Hierarchical multi-objective tilted gradient).

Under Assumption 1,

(40)

where

(41)
Proof.

We proceed as follows. First notice that by invoking Lemma 1,

(42)

where

(43)

where is defined in (4), and is reproduced here:

(44)

On the other hand, by invoking Lemma 1,

(45)

where

(46)

Hence, combining (42) and (45),

(47)

The proof is completed by algebraic manipulations to show that

(48)

Lemma 6 (Sample-level TERM is a special case of hierarchical multi-objective TERM).

Under Assumption 1, hierarchical multi-objective TERM recovers TERM as a special case for . That is

(49)
Proof.

The proof is completed by noticing that setting in (41) (Lemma 5) recovers the original sample-level tilted gradient. ∎

a.4 General properties of the objective for GLMs

In this section, even if not explicitly stated, all results are derived under Assumption 3 with a generalized linear model and loss function of the form (6), effectively assuming that the loss function is the negative log-likelihood of an exponential family [65].

Definition 1 (Empirical cumulant generating function).

Let

(50)
Definition 2 (Empirical log-partition function [66]).

Let be

(51)

Thus, we have

(52)
Definition 3 (Empirical mean and empirical variance of the sufficient statistic).

Let and denote the mean and the variance of the sufficient statistic, and be given by

(53)
(54)
Lemma 7.

For all we have

Next we state a few key relationships that we will use in our characterizations. The proofs are straightforward and omitted for brevity.

Lemma 8 (Partial derivatives of ).

For all and all

(55)
(56)
Lemma 9 (Partial derivatives of ).

For all and all

(57)
(58)

The next few lemmas characterize the partial derivatives of the cumulant generating function.

Lemma 10.

(Derivative of with ) For all and all

(59)
Proof.

The proof is carried out by

(60)

Lemma 11 (Second derivative of with ).

For all and all

(61)
Lemma 12 (Gradient of with ).

For all and all

(62)
Lemma 13 (Hessian of with ).

For all and all

(63)
Lemma 14 (Gradient of with respect to and ).

For all and all

(64)

a.5 General properties of TERM solutions for GLMs

Next, we characterize some of the general properties of the solutions of TERM objectives. Note that these properties are established under Assumptions 3 and 4.

Lemma 15.

For all

(65)
Proof.

The proof follows from definition and the assumption that is an open set. ∎

Lemma 16.

For all

(66)
Proof.

The proof is completed by noting Lemma 15 and Lemma 12. ∎

Lemma 17 (Derivative of the solution with respect to tilt).

Under Assumption 4, for all