1 Introduction
Machine learning models, including deep neural networks, are difficult to optimize, particularly for real world performance. One critical reason is that default loss functions are not always good approximations to evaluation metrics, a phenomenon we term the
lossmetric mismatch (see examples in Figure 1). In fact, loss functions are often designed to be differentiable, and preferably convex and smooth, whereas many evaluation metrics are not. This problem is particularly evident in tasks like metric learning where the evaluation metrics of interest include area under the ROC curve (AUC) or precision/recall rate. These evaluation metrics are nondecomposable (involving statistics of a set of examples), nonlinear and noncontinuous w.r.t. the samplewise predictions and labels. One intuitive remedy is to experiment with many losses and identify one that correlates most to the evaluation metric, but this is inefficient both in terms of computation and human effort to design new losses.The ideal choice of a loss function should tightly approximate the metric across a wide range of parameter values. During model training, the distribution of predictions tend to be different early in training compared to later when close to convergence. For instance, the softmax outputs of a classification model usually change gradually from a near uniform distribution to a sharp multinomial distribution. This poses yet another challenge to the ability of the default loss function to approximate the metric, and indicates the potential benefit of choosing an adaptive loss function, where its parameters are dynamically adjusted to serve as a better surrogate to the evaluation metric.
In addition to the lossmetric mismatch, another difficulty lies in the potential difference between the distribution of the training set and that of the test set, which can be due to different levels of data noise or sampling biases (Ren et al., 2018). Optimizing with respect to the training set thus can exhibit different characteristics from the test set, which will lead to a generalization gap.
We address the lossmetric mismatch by introducing Adaptive Loss Alignment (ALA), a technique that automatically adjusts loss function parameters to directly optimize the evaluation metric on a validation set. We also find this results in an empirical reduction of the generalization gap. ALA learns to adjust the loss function using Reinforcement Learning (RL) at the same time as the model weights are being learned by gradient descent. This helps to align the loss function to the evaluation metric cumulatively over successive training iterations.
We experiment with a variety of classification and metric learning problems. Results demonstrate significant gains of our approach over fixed loss functions and other metalearning methods. Furthermore, ALA is generic and applies to a wide range of loss formulations and evaluation metrics. The learned loss control policy can be viewed as a datadriven curriculum that enjoys good transferability across tasks and data. In summary, the contributions of this work are as follows:

We introduce ALA, a sample efficient RL approach to address the lossmetric mismatch directly by continuously adapting the loss.

We provide RL formulations for metric learning and classification, with state of the art results relative to fixed loss and other metalearning baselines.

We show the versatility of ALA for a wide range of loss functions and evaluation metrics, and demonstrate its transferability across tasks and data.

We empirically show through ablation studies that ALA improves optimization as well as generalization.
2 Related work
Improving model training through efficient experimentation is an active area of research. This includes classic hyperparameter optimization
(Bergstra & Bengio, 2012; Snoek et al., 2012) and more recent gradientbased approaches (Maclaurin et al., 2015; Franceschi et al., 2017). These techniques overlap with architecture meta learning, such as activation functions
(Ramachandran et al., 2017) and neural architecture search, e.g., (Xie et al., 2019). In addition, certain design choices can aid in generalization such as the use of batch normalization
(Ioffe & Szegedy, 2015; Santurkar et al., 2018). ALA focuses on the mechanics of the loss function, and can be used independently or in conjunction with other methods to improve performance.A different strategy is to focus on the optimization process itself and the associated dynamics of learning. Curriculum learning aims to improve optimization by gradually increasing the difficulty of training (Bengio et al., 2009), which has been extended from the predefined case, e.g., focal loss (Lin et al., 2017), to learning a datadriven curriculum (Jiang et al., 2018) or a data reweighting scheme (Ren et al., 2018), by optimizing a proxy loss. Similarly, optimizers can be learned in a data driven fashion using gradient descent, e.g., (Andrychowicz et al., 2016; Wichrowska et al., 2017). In contrast, ALA is a more general method since it adapts the loss function, which allows it to find better optimization strategies to address the lossmetric mismatch.
Other methods focus on the loss function itself. These include approximate methods that bridge the gap between loss functions and evaluation metrics for special cases, e.g., for ranking losses/area under the curve (Eban et al., 2017; Kar et al., 2014). Closest to our work are learning to teach methods that adapt loss functions to optimize target metrics. In (Xu et al., 2019), reinforcement learning is used to learn a discrete optimization schedule that alternates between different loss functions at different timepoints, which is suited to multiobjective problems, but does not address more general nondiscrete cases. In (Wu et al., 2018) and (Jenni & Favaro, 2018), gradientbased techniques are used to optimize differentiable proxies for the evaluation metric, whereas our method optimizes the metric directly.
ALA differs from the above techniques by optimizing the evaluation metric directly via a sample efficient RL policy that iteratively adjusts the loss function. Our work also extends beyond classification to the metric learning case.
3 Adaptive Loss Alignment (ALA)
We start by formally defining our learning problem, where we would like to improve an evaluation metric on validation set
, for a parametric model
. The evaluation metric between the groundtruth and model prediction given input , can be either decomposable over samples (e.g., classification error) or nondecomposable like area under the precision recall curve (AUCPR) and Recall@k. We learn to optimize for the validation metric and expect it to be a good indicator of the model performance on test set .Optimizing directly for evaluation metrics is a challenging task. This is because the model weights are actually obtained by optimizing a loss function on training set , i.e., by solving . However, in many cases the loss is only a surrogate of the evaluation metric , which can be nondifferentiable w.r.t. . Moreover, the loss is optimized on the training set instead of or .
To address the above lossmetric mismatch issue, we propose to learn an adaptive loss function with loss parameters . The goal is to align the adaptive loss with evaluation metric , on a heldout dataset . This leads to an alternate direction optimization problem — (1) find metricminimizing loss parameters and (2) update the model weights under the resultant loss by, e.g.
, Stochastic Gradient Descent (SGD). We have:
(1) 
where in practice both the outer loop and inner loop are approximated by a few steps of iterative optimization (e.g., SGD updates). Hence, we denote as the loss function parameters at time step and as the corresponding model parameters. The key here is to bridge the gap between evaluation metric and loss function over time, conditioned on the found local optimum .
3.1 Reinforcement Learning of ALA
To capture the conditional relations between loss and evaluation metric, we formulate a reinforcement learning problem. The task is to predict the best change to loss parameters such that optimizing the adjusted loss aligns better with the evaluation metric . In other words, taking an action that adjusts the loss function should produce a reward that reflects how much the metric will improve on heldout data . This is analogous to teaching the model how to better optimize on seen data and to better generalize (in terms of ) on unseen data .
The underlying model behind Reinforcement Learning (RL) is a Markov Decision Process (MDP) defined by states
and actions at discrete time steps within an episode. In our case, an episode is naturally defined as a set of consecutive timesteps, each consisting of training iterations. In other words, the RL policy collects episodes at a slower timescale than the training of . Figure 2 illustrates the schematic of our RL framework. Our state records training progress information (e.g., induced loss value), and the lossupdating action is sampled from a stochastic policy . We implement the policy as a neural network parameterized by . Training under the updated loss will transition to a new state and produce a reward signal . We define the reward by the improvement in evaluation metric .We optimize the losscontrolling policy with a policy gradient approach, similar to REINFORCE (Williams, 1992). The objective is to maximize the expected total return
(2) 
where is the total return of an episode . The updates to the policy parameters are given by the gradient
(3) 
where
is a variancereducing baseline implemented as the exponential moving average of previous rewards.
Local policy learning. As shown in Figure 2, an episode for RL is a set of consecutive time steps. Choosing one extreme, could cover the entire length of a training run. However, waiting for the model to fully train and repeating this process enough times for a policy to converge requires a great deal of computational resources. We choose the other extreme and use episodes of a single step, i.e., (still composed of training iterations). One advantage of doing so is that a single training run of can contribute many episodes to the training of , making it sample efficient. Although this ignores longerterm effects of chosen actions, we will show empirically that these onestep episodes are sufficient to learn competent policies, and increasing does not convey much benefit in our experiments (see Supplementary Materials). The onestep RL setting is similar to a contextual bandit formulation, although actions affect future states. Thus, unlike bandit formulations, the ALA controller can learn statetransition dynamics.
3.2 Learning Algorithm
In this section, we will describe the concrete RL algorithm for simultaneous learning of the ALA policy parameters and model weights .
Rewards: Our reward function measures the relative reduction in validation metric , after gradient descent iterations with an updated loss function . To represent the cumulative performance between model updates, we define:
(4) 
where is a discount factor that weighs more on the recent metric. The main model weights are updated for iterations. Then we quantize the reward to as follows:
(5) 
which encourages continuous error metric decreases regardless of magnitude (but until the maximum training iteration).
Action space: For every element of the loss parameters , we sample action from the discretized space , with being a predefined stepsize. Actions will update the loss parameters at each time step. Our policy network has
output neurons for each loss parameter, and
is sampled from a softmax distribution over these neurons.State space: Our policy network state consists of four components:

Some task dependent validation statistics , e.g.
, the log probabilities of different classes, observed at multiple timesteps
. 
The relative change of validation statistics from their moving average.

The current loss parameters .

The current iteration number normalized by the total iteration number of the full training run of .
Here we use the validation statistics, among others, to capture model training states. Recall our goal is to find rewarding lossupdating actions to improve the evaluation metric on validation set. A successful loss control policy would be able to model the implicit relation between the validation statistics, which is the state of the RL problem, and validation metric, which is the reward. In other words, ALA should learn to mimic the loss optimization process for decreasing the validation metric cumulatively. We choose to use validation statistics instead of training statistics in because the former is a natural proxy of the validation metric. Note the validation statistics are normalized in our state representation. This allows for generic policy learning which is independent of the actual model predictions from different tasks or loss formulations.
Algorithm: Algorithm 1 shows how our RL algorithm alternates between updating the loss control policy and updating model weights . Model weights are updated via minibatch SGD on , while policy is updated every SGD iterations on . We enhance policy learning by training in parallel multiple main networks, which we refer to as child models. Each child model is initialized with random weights but sharing the loss controller. At each timestep for policy update, we collect the episodes using from all the child models. This independent set of episodes, together with replay memory that adds randomness to the learning trajectories, help to alleviate the nonstationarity in online policy learning. As a result, more robust policies are learned, as verified in our experiments.
It is worth noting that the initial loss parameters are important for efficient policy learning. Proper initialization of must ensure default loss function properties that depend on the particular form of loss parameterization for a given task, e.g., identity class correlation matrix in classification (see Section 4).
4 Instantiation in Typical Learning Problems
Classification: We learn to adapt the parametric classification loss function introduced in (Wu et al., 2018):
(6) 
where
is the sigmoid function, and
denotes the loss function parameters with being the number of classes. denotes the onehot representation of class labels, and denotes the multinomial model output. This adaptive loss function is a generalization of the crossentropy loss whereis fixed as the identity matrix.
The matrix encodes timevarying class correlations. A positive value of encourages the model to increase the prediction probability of class given groundtruth class . A negative value of , on the other hand, penalizes the confusion between class and . Thus when changes as learning progresses, it is possible to implement a hierarchical curriculum for classification, where similar classes are grouped as a super class earlier in training, and discriminated later as training goes further along. To learn the curriculum automatically, we initialize as an identity matrix (reduced to the standard crossentropy loss in this case), and update over time by the ALA policy .
To learn to update
, we first construct a confusion matrix
of model prediction on validation set , and define(7) 
where is an indicator function outputing 1 when equals class and 0 otherwise.
We take a parameter efficient approach to update each loss parameter based on the observed class confusions . In other words, the ALA loss controller collects the validation statistics only for class pairs at each time step , in order to construct state for updating the corresponding loss parameter (see Figure S2 in Supplementary Materials). Different class pairs share the same controller, and we update and to the same value (normalized between ) to ensure class symmetry. This implementation is much more parameter efficient than learning to update the whole matrix based on . Furthermore, it does not depend on the number of classes for a given task, thus enabling us to transfer the learned policy to another classification task with an arbitrary number of classes (see Section 6.3).
Metric Learning: A metric learning problem learns a distance metric to encode semantic similarity. Typically, the resulting distance metric cannot directly cater to different, and sometimes contradicting performance metrics of interest (e.g., verification vs. identification rate, or precision vs. recall). This gap is more pronounced in the presence of common techniques like hard mining that only have indirect effects on final performance. Therefore, metric learning can serve as a strong testbed for learning methods that directly optimize evaluation metrics.
The standard triplet loss (Schroff et al., 2015) for metric learning can be formulated as:
(8) 
where is the squared distance function over both (distance between anchor instance and positive instance ) and (distance between and negative instance ), while is a margin parameter.
As (Wu et al., 2017) pointed out, the shape of distance function matters. The concave shape of for negative distance will lead to diminishing gradients when approaches zero. Here we propose to reshape the distance function adaptively with two types of loss parameterizations.
For the first parametric loss, called Distance mixture, we adopt 5 differentshaped distance functions for both and , and learn an linear combination of them via :
(9) 
where and correspond to the increasing and decreasing distance functions to penalize large and small respectively. In this case
, and is initialized as a binary vector such that
selects the default distance functions and for and . For RL, the validation statistics in state are simply represented by the computed distance or , and our ALA controller updates accordingly. Supplementary Materials specify the forms of and and show that final performance is not very sensitive to their design choices. It is more important to learn their dynamic weightings.Similar to the focal loss (Lin et al., 2017), we also introduce a Focal weightingbased loss formulation as follows:
(10) 
where , and and denote the distance between anchor instance and the positive and negative instances in the batch. We use these distances as validation statistics for ALA controller to update . While here is the distance offset.
5 Implementation Details
The main model architecture and maximum number of training iterations are taskdependent. In the parallel training scenario for our experiments, 10 child models were trained sharing the same ALA controller. The controller is instantiated as an MLP consisting of 2 hidden layers each with 32 ReLU units. Our state
includes a sequence of validation statistics observed from past 10 timesteps. Table S4 in Supplementary Materials quantifies these choices.The ALA controller is learned using the REINFORCE policy gradient method (which worked better than Qlearning variants in early experiments). We use a learning rate of 0.001 for policy learning. Training episodes are collected from all child networks every gradient descent iterations. We set the discount factor (Equation 4), loss parameter updating step and distance offset (Equation 10), but found that performance is robust to variations of these hyperparameters.
6 Results
We evaluate the proposed ALA method on classification and metric learning tasks. Our ALA controller is learned by multimodel training for both tasks unless otherwise stated.
6.1 Classification
We train and evaluate ALA using two different evaluation metrics to demonstrate generality: (1) classification error, and (2) area under the precision recall curve (AUCPR). We experiment on CIFAR10 (Krizhevsky, 2009) with 50k images for training and 10k images for testing. For training a loss controller, we divide the training set randomly into a new training set of 40k images and a validation set of 10k images. We use MomentumSGD for training. The compared methods below use the full 50k training images and their optimal hyperparameters.
Optimizing classification error: Table 1 reports classification errors using the popular network architectures: ResNet32 (He et al., 2016), WideResNet (WRN) (Zagoruyko & Komodakis, 2016) and DenseNet (Huang et al., 2017). The selfpaced method (Kumar et al., 2010) is a predefined curriculum learning scheme based on example hardness. LSoftmax (Liu et al., 2016) is a strong handdesigned loss function. The recent stateoftheart L2TDLF (Wu et al., 2018) method adapts a loss function under the same formulation of Equation 6. However, the objective of L2TDLF is to minimize an evaluation metric surrogate that is gradientfriendly. ALA outperforms L2TDLF using single network training, and improves further with multinetwork training (default). The competitive performance of our single network RL training validates its sample and time efficiency. Other ALA baselines train with either random loss parameters (identity matrix perturbed by 10% noise), or use the prediction confusion matrix (Equation 7) as . These baselines do not perform well, while ALA can learn meaningful loss parameters to align with evaluation metrics. The L2T method (Fan et al., 2018) similarly uses RL to optimize evaluation metrics, but requires 50 episodes of full training. In contrast, our ALA controller is trained while the main model is being trained, thereby allowing for over faster learning with even stronger error reduction.
Optimizing AUCPR: Unlike classification error, AUCPR is a highlystructured and nondecomposable evaluation metric. To demonstrate our ability to optimize different metrics, we change the reward for ALA to AUCPR as defined in (Eban et al., 2017). Table 2 shows our method achieves the AUCPR metric of 94.9% (10run average) on CIFAR10, outperforming the SGDbased optimization method (Eban et al., 2017). Our advantages are also evident over methods that do not optimize the metric of interest, e.g., via the pairwise AUCROC surrogate (Rakotomamonjy, 2004) and crossentropy loss that is a proxy for classification accuracy.
Method  ResNet32  WRN  DenseNet 

crossentropy  7.51  3.80  3.54 
Selfpaced (Kumar et al., 2010)  7.47  3.84  3.50 
LSoftmax (Liu et al., 2016)  7.01  3.69  3.37 
L2T (Fan et al., 2018)  7.10     
L2TDLF (Wu et al., 2018)  6.95  3.42  3.08 
ALA (random matrix ) 
8.230.41  4.690.28  4.150.33 
ALA (confusion matrix )  7.420.04  3.740.02  3.550.02 
ALA (singlenetwork)  6.850.09  3.390.04  3.030.04 
ALA (multinetwork)  6.790.07  3.340.04  3.010.02 
Method  AUCPR 

crossentropy loss for optimizing accuracy  84.6 
Pairwise AUCROC loss (Rakotomamonjy, 2004)  94.2 
AUCPR loss (Eban et al., 2017)  94.2 
ALA  94.90.14 
6.2 Metric Learning
To validate ALA on metric learning tasks, we perform image retrieval experiments on the Stanford Online Products (SOP) dataset
(Song et al., 2016), and face recognition (FR) experiments on the LFW dataset
(Huang et al., 2007). The SOP dataset contains 120,053 images of 22,634 categories. The first 10,000 and 1,318 categories are used for training and validation, and the remaining are used for testing. We optimize for the evaluation metric of average Recall@k on SOP. For LFW, we train for the verification accuracy and test on the LFW verification benchmark containing 6,000 verification pairs. Note both the average Recall@k and verification accuracy are nondecomposable evaluation metrics.Table 3 (top two cells) compares ALA to recent methods on the SOP dataset. ALA is applied to two representative metric learning frameworks—Triplet (Schroff et al., 2015) and Margin (Wu et al., 2017), using the respective network architectures. The two works differ in the loss formulation and data sampling strategies. We can see that ALA leads to consistently significant gains at all recall levels for both frameworks. Margin+ALA outperforms recent embedding ensembles based on boosting (BIER) and attention (ABE8) mechanisms, as well as HTL using a handdesigned hierarchical classtree. This indicates the advantages of our adaptive loss function and direct metric optimization. On LFW, ALA improves the triplet framework with distance mixtures, and achieves stateoftheart verification accuracy 99.57% under the small training data protocol. Please refer to the Supplementary Materials for comparisons to recent strong FR methods on LFW.
k  1  10  100  1000 
Triplet (Schroff et al., 2015)  66.7  82.4  91.9   
Margin (Wu et al., 2017)  72.7  86.2  93.8  98.0 
BIER (Opitz et al., 2017)  72.7  86.5  94.0  98.0 
HTL (Ge et al., 2018)  74.8  88.3  94.8  98.4 
ABE8 (Kim et al., 2018)  76.3  88.4  94.8  98.2 
Triplet + ALA (Distance mixture)  75.7  89.4  95.3  98.6 
Margin + ALA (Distance mixture)  78.9  90.7  96.5  98.9 
Margin + ALA (Focal weighting)  77.9  90.1  95.8  98.7 
Margin + ALA (FR policy transfer)  75.2  89.2  94.9  98.4 

6.3 Transfer Learning
One potential disadvantage of handdesigned loss functions compared to learned ones is that the former may be specific to task and/or data, whereas the latter are usually generic and widely applicable. Here we test the degree to which ALA extracts general knowledge about how to adapt the loss. We do so by conducting policy transfer experiments on classification and metric learning tasks in the following.
For classification, we transfer the learned policy that updates pairwise class correlations based on their prediction confusions. Specifically, we train an ImageNet (Deng et al., 2009)classifier with RMSProp optimizer, but using the fixed ALA loss policy learned from CIFAR10 (with DenseNet). Despite the difference between number of classes and input distribution, the policy transfer is straightforward thanks to the weight sharing design of the policy network, We compare with PowerSigncd (Bello et al., 2017) that transfers a learned policy for optimization from CIFAR10 to ImageNet. Table 4 compares all methods using identical NASNetA network architectures. It shows that the transferred ALA policy outperforms the baselines without careful tuning, and is comparable to the ImageNettuned RMSProp+ALA. These results show that ALA is able to learn a policy on small, quicktotrain datasets in order to guide largescale training, which is desirable for fast and efficient learning. Notably, we use a different optimizer—SGD—for training the ALA policy on CIFAR10, which shows that ALA is robust to the optimizer paired with the loss policy.
We also show that policy transferability holds in the metric learning case. Here we transfer the ALA policy learned from FR experiments to guide the training on SOP for image retrieval. The bottom row of Table 3 shows that the transferred policy is competitive with specialized learning on the target data domain. This demonstrates the superior generalization ability of ALA.
Method  Training  Testing 

crossentropy 
0.32  7.51 
ALA  0.120.02  6.790.07 
ALA2nd run (using policy from 1st run)  0.140.01  6.720.04 
ALA2nd run (policy finetuning)  0.080.01  6.720.02 
6.4 Analysis
Ablation studies: (i) Training vs. testing metric: Table 5 (top cell) compares ALA with fixed crossentropy loss in both training and testing classification errors. Results suggest that through dynamic lossmetric alignment, ALA improves both optimization on the training data and generalization on test data. (ii) Time efficiency: Table 5 (bottom cell) compares our default online policy learning against continuing to train with ALA for a second run (thus halving the learning efficiency). The compared baselines either reuse the learned policy in the 1st run or finetune the policy during the 2nd run. These baselines lead to marginal gains due to more learning episodes. On the other hand, online ALA suffices to learn competent policies. (iii) Robustness to different loss parameterizations: Table 3 (penultimate row) shows that ALA works similarly well when using a focal weighted loss function for the metric learning task. We conjecture that ALA can work with minimal tuning across a variety of parameterized loss formulations.
Insights on optimization vs. generalization: Figure 3 analyzes the effects of ALA policies on optimization and generalization behavior by switching the reward signal in a 2factor design: training vs. validation data and loss vs. evaluation metric on CIFAR10. When the training loss (i.e., the default crossentropy loss) or metric is used as reward, we compute them on the whole training set . For this and the following studies, we train with the ResNet32 architecture.
Figure 3(a) isolates the optimization effects by comparing ALA, with the crossentropy loss as the reward, and minimizing crossentropy loss directly. ALA is shown to encourage better optimization with consistently lower loss values. We note that this is a remarkable result as it shows that ALA is facilitating optimization even when the objective is fully differentiable. Figure 3(b) examines both optimization and generalization when ALA uses validation loss as a reward where we monitor training and test (generalization) errors for ALA and fixed crossentropy loss. We observe that the generalization error on test set is indeed improved, and the optimization in training error also sees small improvements. The optimization and generalization gains are larger when the validation metric is used as reward, which further addresses the lossmetric mismatch. By comparison, the training metricbased reward yields faster training error decrease, but smaller gains in test error potentially due to the diminishing error/reward in training data.
Reasoning behind the improved optimization by ALA: First, we speculate that ALA improves optimization by dynamically smoothing the loss landscape. We verify this hypothesis by taking measurements of the loss surface convexity, calculating the Gaussian curvature of the loss surface around each model checkpoint following (Li et al., 2018). Figure 4 shows a smoother loss surface from the model trained with ALA. This confirms that ALA is learning to manipulate the loss surface in a way that improves convergence of SGD based optimization processes, in agreement with findings in (Li et al., 2018).
Second, we study how ALA improves performance by addressing the lossmetric mismatch. Figure 5 shows CIFAR10 training with ALA using (a) classification error and (b) AUCPR metrics. We use the validation metricbased reward for ALA by default, and follow the aforementioned training settings for each metric. We can see that the fixed crossentropy loss, without explicit lossmetric alignment, suffers from an undesirable mismatch between the monitored validation loss and test metric in both cases (see the different curve shapes and variances). ALA reduces the lossmetric mismatch by sequentially aligning the loss to the evaluation metric, even though classification error and AUCPR cases exhibit different patterns. In doing so, ALA not only lowers the validation loss but also the generalization test error.
7 Conclusion
We introduced ALA to make it easier to improve model performance on taskspecific evaluation metrics. Performance is often hindered by the lossmetric mismatch, and we showed that ALA overcomes this by sequentially aligning the loss to the metric. We demonstrated significant gains over existing methods on classification and metric learning, using an efficient method that is simple to tailor, which makes ALA useful for automated machine learning. Intriguingly, ALA improves optimization and generalization simultaneously, in contrast to methods that focus on one or the other. We leave theoretical understanding of the effectiveness of loss function adaptation to future work.
Acknowledgements
The authors want to thank (in alphabetical order) Leon Gatys, Kelsey Ho, Qi Shan, Feng Tang, Karla Vega, Russ Webb and many others at Apple for helpful discussions during the course of this project. In addition, we are grateful to Harry Guo, Myra Haggerty, Jerremy Holland and John Giannandrea for supporting the research effort. We also thank the ICML reviewers for providing useful feedback.
References
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gómez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems 29, pp. 3981–3989, 2016.
 Bello et al. (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. Neural optimizer search with reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, pp. 459–468, 2017.
 Bengio et al. (2009) Bengio, Y., Louradour, J., Collobert, R., and Weston, J. Curriculum learning. In Proceedings of the 26th International Conference on Machine Learning, pp. 41–48, 2009.
 Bergstra & Bengio (2012) Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012.
 Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and FeiFei, L. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.

Eban et al. (2017)
Eban, E., Schain, M., Mackey, A., Gordon, A., Rifkin, R., and Elidan, G.
Scalable Learning of NonDecomposable Objectives.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, pp. 832–840, 2017.  Fan et al. (2018) Fan, Y., Tian, F., Qin, T., Li, X.Y., and Liu, T.Y. Learning to teach. In International Conference on Learning Representations, 2018.
 Franceschi et al. (2017) Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. Forward and reverse gradientbased hyperparameter optimization. In Proceedings of the 34th International Conference on Machine Learning, pp. 1165–1173, 2017.
 Ge et al. (2018) Ge, W., Huang, W., Dong, D., and Scott, M. R. Deep metric learning with hierarchical triplet loss. In The European Conference on Computer Vision (ECCV), 2018.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 Huang et al. (2017) Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 Huang et al. (2007) Huang, G. B., Ramesh, M., Berg, T., and LearnedMiller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, University of Massachusetts, Amherst, 2007.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456, 2015.
 Jaderberg et al. (2017) Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., Fernando, C., and Kavukcuoglu, K. Population based training of neural networks. arXiv preprint arXiv:1711.09846, 2017.
 Jenni & Favaro (2018) Jenni, S. and Favaro, P. Deep bilevel learning. In The European Conference on Computer Vision (ECCV), 2018.
 Jiang et al. (2018) Jiang, L., Zhou, Z., Leung, T., Li, L., and FeiFei, L. Mentornet: Learning datadriven curriculum for very deep neural networks on corrupted labels. In Proceedings of the 35th International Conference on Machine Learning, pp. 2309–2318, 2018.
 Kar et al. (2014) Kar, P., Narasimhan, H., and Jain, P. Online and stochastic gradient methods for nondecomposable loss functions. In Advances in Neural Information Processing Systems 27, pp. 694–702, 2014.
 Kim et al. (2018) Kim, W., Goyal, B., Chawla, K., Lee, J., and Kwon, K. Attentionbased ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), pp. 760–777, 2018.
 Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 Kumar et al. (2010) Kumar, M. P., Packer, B., and Koller, D. Selfpaced learning for latent variable models. In Advances in Neural Information Processing Systems 23, pp. 1189–1197. 2010.
 Li et al. (2018) Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Neural Information Processing Systems, 2018.
 Lin et al. (2017) Lin, T.Y., Goyal, P., Girshick, R., He, K., and Dollár, P. Focal Loss for Dense Object Detection. In International Conference on Computer Vision (ICCV), 2017.

Liu et al. (2016)
Liu, W., Wen, Y., Yu, Z., and Yang, M.
Largemargin softmax loss for convolutional neural networks.
In Proceedings of The 33rd International Conference on Machine Learning, pp. 507–516, 2016.  Liu et al. (2017) Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., and Song, L. Sphereface: Deep hypersphere embedding for face recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 Maclaurin et al. (2015) Maclaurin, D., Duvenaud, D., and Adams, R. P. Gradientbased hyperparameter optimization through reversible learning. In Proceedings of the 32nd International Conference on Machine Learning, pp. 2113–2122, 2015.
 Opitz et al. (2017) Opitz, M., Waltner, G., Possegger, H., and Bischof, H. BIER — boosting independent embeddings robustly. In International Conference on Computer Vision (ICCV), 2017.
 Rakotomamonjy (2004) Rakotomamonjy, A. Optimizing area under roc curve with SVMs. In ROCAI, 2004.
 Ramachandran et al. (2017) Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. CoRR, abs/1710.05941, 2017.

Ren et al. (2018)
Ren, M., Zeng, W., Yang, B., and Urtasun, R.
Learning to reweight examples for robust deep learning.
In Proceedings of the 35th International Conference on Machine Learning, pp. 4331–4340, 2018.  Santurkar et al. (2018) Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How does batch normalization help optimization? In Advances in Neural Information Processing Systems 31, pp. 2488–2498, 2018.
 Schroff et al. (2015) Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 815–823, 2015.
 Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2951–2959, 2012.
 Song et al. (2016) Song, H. O., Xiang, Y., Jegelka, S., and Savarese, S. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4004–4012, 2016.
 Sun et al. (2014) Sun, Y., Chen, Y., Wang, X., and Tang, X. Deep learning face representation by joint identificationverification. In Advances in Neural Information Processing Systems, 2014.
 Wang et al. (2018) Wang, H., Wang, Y., Zhou, Z., Ji, X., Li, Z., Gong, D., Zhou, J., and Liu, W. CosFace: Large margin Cosine loss for deep face recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 Wen et al. (2016) Wen, Y., Zhang, K., Li, Z., and Qiao, Y. A discriminative feature learning approach for deep face recognition. In The European Conference on Computer Vision (ECCV), 2016.
 Wichrowska et al. (2017) Wichrowska, O., Maheswaranathan, N., Hoffman, M. W., Colmenarejo, S. G., Denil, M., de Freitas, N., and SohlDickstein, J. Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine Learning, pp. 3751–3760, 2017.
 Williams (1992) Williams, R. J. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
 Wu et al. (2017) Wu, C.Y., Manmatha, R., Smola, A. J., and Krähenbühl, P. Sampling matters in deep embedding learning. In International Conference on Computer Vision, 2017.
 Wu et al. (2018) Wu, L., Tian, F., Xia, Y., Fan, Y., Qin, T., JianHuang, L., and Liu, T.Y. Learning to teach with dynamic loss functions. In Advances in Neural Information Processing Systems 31, pp. 6467–6478, 2018.
 Xie et al. (2019) Xie, S., Zheng, H., Liu, C., and Lin, L. SNAS: stochastic neural architecture search. In International Conference on Learning Representations, 2019.
 Xu et al. (2019) Xu, H., Zhang, H., Hu, Z., Liang, X., Salakhutdinov, R., and Xing, E. Autoloss: Learning discrete schedule for alternate optimization. In International Conference on Learning Representations, 2019.
 Yi et al. (2014) Yi, D., Lei, Z., Liao, S., and Li, S. Z. Learning face representation from scratch. CoRR, abs/1411.7923, 2014.
 Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), 2016.
 Zoph & Le (2017) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In International Conference on Learning Representations, 2017.
Supplementary Material
Appendix A More Experiments
Image classification on CIFAR100: We train and evaluate ALA for the metric of classification error on the CIFAR100 dataset (Krizhevsky, 2009). As in CIFAR10, we divide the training set of CIFAR100 randomly into a new training set of 40k images and a validation set of 10k images, for loss controller learning. The 10k testing images are used for evaluation. We compare with the recent methods that use the full 50k training images and their optimal hyperparameters. For ALA, multinetwork training is adopted by default for robust online policy learning. Each network is trained via MomentumSGD.
Table S1 reports classification errors using different ResNet (He et al., 2016) architectures. For all network architectures, ALA outperforms both handdesigned loss functions, e.g., LSoftmax (Liu et al., 2016), and the adaptive loss function that acts as a differentiable metric surrogate in L2TDLF (Wu et al., 2018). This validates the benefits of directly optimizing the evaluation metric using ALA.
Face verification on LFW: We evaluate the performance of our ALAbased metric learning method on a face verification task using the LFW dataset (Huang et al., 2007). The LFW verification benchmark contains 6,000 verification pairs. For a fair comparison with recent approaches we train ALA using the same 64layer ResNet architecture proposed in (Liu et al., 2017; Wang et al., 2018) as our main model. We follow the small training data protocol (Huang et al., 2007) and train and validate on the popular CASIAWebFace dataset (Yi et al., 2014) which contains 494,414 images of 10,575 people. The training images with identities appearing in the test set are removed. Our ALA controller is trained to optimize the verification accuracy metric on the validation set.
Table S2 compares ALA to recent face recognition methods on LFW. These methods often adopt a strong but handdesigned loss function to improve class discrimination. In contrast, ALA adaptively controls the triplet loss function (Schroff et al., 2015), achieving stateoftheart performance even for different parameterizations, where we specifically studied focal weighting (Lin et al., 2017) and distance mixture formulations. These results further verify the advantages of ALA to directly optimize for the target metric regardless of the specific formulation of loss function to be controlled.
Method  ResNet8  ResNet20  ResNet32 

crossentropy  39.79  32.33  30.38 
LSoftmax (Liu et al., 2016)  38.93  31.65  29.56 
L2TDLF (Wu et al., 2018)  38.27  30.97  29.25 
ALA  37.780.09  30.540.07  29.060.09 
Method  Accuracy 

Softmax loss  97.88 
Softmax+Contrastive (Sun et al., 2014)  98.78 
Triplet loss (Schroff et al., 2015)  98.70 
LSoftmax loss (Liu et al., 2016)  99.10 
Softmax+Center loss (Wen et al., 2016)  99.05 
SphereFace (ASoftmax) (Liu et al., 2017)  99.42 
CosFace (LMCL) (Wang et al., 2018)  99.33 
Triplet + ALA (Focal weighting)  99.49 
Triplet + ALA (Distance mixture)  99.57 
Classification  Metric learning  
Method  Error  Method  Recall 
crossentropy  7.51  Triplet (Schroff et al., 2015)  66.7 
L2T (Fan et al., 2018)  7.10  Margin (Wu et al., 2017)  72.7 
L2TDLF (Wu et al., 2018)  6.95  ABE8 (Kim et al., 2018)  76.3 
ALA  6.79  Margin + ALA  78.9 
Contextual bandit  7.34  Contextual bandit  73.1 
PBT (Jaderberg et al., 2017)  7.29  PBT (Jaderberg et al., 2017)  73.6 

Appendix B More Analyses
Baseline comparisons: Table S3 compares some related baselines in both classification and metric learning tasks to further highlight the benefits of ALA. In particular, we compare with the contextual bandit method and populationbased training (PBT) (Jaderberg et al., 2017). The two baselines follow the same experimental settings on respective datasets, as detailed in the main paper.
The contextual bandit method changes loss parameters (i.e., actions) according to the current training states, similar to an online hyperparameter search scheme. Following the same loss parameterizations for classification and metric learning, the method increases weights for those confusing class pairs and evaluation metricimproving distance functions respectively, and otherwise downweights them. This is similar to our onestep RL setting except that in ALA, actions affect future states, making it an RL problem. Table S3
illustrates that onestep RLbased ALA consistently outperforms the heuristic contextual bandit method. We believe more advanced bandit algorithms can work better, but RL has the capacity to learn flexible statetransition dynamics. Moreover, our RL setting can be extended to use multistep episodes (Figure
S1). This allows to model longerterm effects of actions, while contextual bandits always obtain immediate reward from a single action.Recall that we train in parallel 10 child models by default, for robust ALA policy learning. We are thus interested to see how this compares to PBT techniques (using the same 10 child models). Table S3 shows that PBT does not help as much as ALA, which suggests the learned ALA policy is more powerful than model ensembling or parameter tuning. We will show later (in Figure S1) that ALA can achieve competitive performance even with 1 child model which enjoys higher learning efficiency.
Ablation study: Table S4 shows the results of ablation studies on the design choices of ALA loss controller and state representation. As in Table S3, we experiment with the example tasks of classification and metric learning under the same settings. Looking at the top cell of Table S4, we find that switching from 2layer loss controller to 1layer leads to a consistent performance drop; on the other hand, the 3layer loss controller does not help much. The bottom cell of Table S4 quantifies the effects of the four components of our policy state . We can see that it is relatively more important to keep the historical sequence of validation statistics (besides the ones at current timestep) and the current loss parameters in the state representation. The relative change of validation statistics (from their moving average) and the normalized iteration number also have marginal contributions.
Computational cost: Under the classification and metric learning tasks considered in the paper, our simultaneous (single) model training and ALA policy learning often incur an extra cost as indexed by wallclock time over regular model training. However, this overhead is often canceled out by the convergence speedup of the main model. Our multimodel training together with policy learning is able to achieve stronger performance with modest additional () computational overhead for policy learning, at the cost of using distributed training to collect replay episodes. This is much more efficient than those metalearning methods, e.g., (Fan et al., 2018; Zoph & Le, 2017) that learn the policy by training the main model to convergence multiple times (e.g., 50 times).
Classification  Metric learning  
Method  Error  Method  Recall 
ALA (1layer MLP)  +0.06  Margin+ALA (1layer MLP)  0.5 
ALA (3layer MLP)  0.03  Margin+ALA (3layer MLP)  0.1 
ALA ( w/o history)  +0.11  Margin+ALA ( w/o history)  1.4 
ALA ( w/o statistics)  +0.04  Margin+ALA ( w/o statistics)  0.2 
ALA ( w/o )  +0.05  Margin+ALA ( w/o )  0.6 
ALA ( w/o iter#)  +0.02  Margin+ALA ( w/o iter#)  0.3 
Sample efficiency: Figure S1 illustrates the sample efficiency of ALA’s RL approach in the example task of CIFAR10 classification. We train the ResNet32 model and use the default reward based on the validation metric. Figure S1(a) shows that using episodes consisting of a single training step suffices to learn competent loss policies with good performance. Figure S1(b) further shows improvements from parallel training with multiple child models that provide more episodes for policy learning. We empirically choose to use 10 child models, which only incurs an extra time cost for policy learning, thus striking a good performance tradeoff.
Policy visualization for classification: Figure S2 illustrates the ALA policy learned for classification, which performs actions to adjust the loss parameters in (i.e., class correlations) dynamically. We observe that the ALA controller tends to first merge similar classes with positive , and then gradually discriminates between them with negative . This indicates a learned curriculum that guides model learning to achieve both better optimization and generalization.
Policy visualization for metric learning: We visualize the learned ALA policy for metric learning under a parametric loss formulation that mixes different distance functions. Figure S3(a) first shows the distance functions and we apply to distance (between anchor and positive instances) and distance (between anchor and negative instances), respectively. Specifically, defines 5 increasing distance functions to penalize large , and defines 5 decreasing distance functions to penalize small . We empirically found our performance is relatively robust to the design choices of distance functions (within verfication accuracy on LFW among our early trials), as long as they differ. The ability to learn adaptive weightings over these distance functions plays a more important role.
Figure S3(b) demonstrates the evolution of weights over our distance functions on the Stanford Online Products dataset. Note that while the weights for our default distance functions and are both initialized as 1, our ALA controller learns to assign larger weights to those highpenalty distance functions over time. This implies an adaptive ”hard mining” curriculum learned from data that is more flexible than handdesigned alternatives.
Appendix C Limitations
In this work we studied multiple evaluation metric formulations (classification accuracy and AUCPR for the classification settings, and Recall@k and verification accuracy for metric learning). While this includes nondecomposable metrics, we did not extend to more complex scenarios that might reveal further benefits of ALA. In future work we plan to apply ALA to multiple simultaneous objectives, where the controller will need to weigh between these objectives dynamically. We would also like to examine cases where the output of a given model is an input into a more complex pipeline, which is common in production systems (e.g., detectionalignmentrecognition pipelines). This requires further machinery to be developed for making reward evaluation efficient enough to learn the policy jointly with training the different modules.
Another area where ALA can be further developed is to make it less dependent on specific task types and loss/metric formulations. Ideally, a controller can be trained through continual learning to handle different scenarios flexibly. This would enable the use of ALA in distributed crowd learning settings where model training gets better and better over time.
Finally, an interesting area to study further is how ALA behaves in dynamically changing environments where available training data can change over time (e.g., lifelong learning, online learning, metalearning). Ideally, ALA is suited to tackle these challenges, and we will continue to explore this in future work.
Comments
There are no comments yet.