AutoLoss: Automated Loss Function Search in Recommendations

06/12/2021 ∙ by Xiangyu Zhao, et al. ∙ ByteDance Inc. Michigan State University 10

Designing an effective loss function plays a crucial role in training deep recommender systems. Most existing works often leverage a predefined and fixed loss function that could lead to suboptimal recommendation quality and training efficiency. Some recent efforts rely on exhaustively or manually searched weights to fuse a group of candidate loss functions, which is exceptionally costly in computation and time. They also neglect the various convergence behaviors of different data examples. In this work, we propose an AutoLoss framework that can automatically and adaptively search for the appropriate loss function from a set of candidates. To be specific, we develop a novel controller network, which can dynamically adjust the loss probabilities in a differentiable manner. Unlike existing algorithms, the proposed controller can adaptively generate the loss probabilities for different data examples according to their varied convergence behaviors. Such design improves the model's generalizability and transferability between deep recommender systems and datasets. We evaluate the proposed framework on two benchmark datasets. The results show that AutoLoss outperforms representative baselines. Further experiments have been conducted to deepen our understandings of AutoLoss, including its transferability, components and training efficiency.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the era of information explosion, recommender systems play a pivotal role in alleviating information overload, which vastly enhance user experiences in many commercial applications, such as generating playlists in video and music services (Zhao et al., 2021, 2020d), recommending products in online stores (Zhao et al., 2020c; Zou et al., 2020; Fan et al., 2020; Zhao et al., 2018a, b), and suggesting locations for geo-social events (Liu et al., 2017; Zhao et al., 2016; Guo et al., 2016)

. With the recent growth of deep learning techniques, there have been increasing interests in developing deep recommender systems (DRS) 

(Nguyen et al., 2017; Wu et al., 2016a). DRS has improved the recommendation quality since they can effectively learn feature representations and capture the nonlinear relationships between users and items via deep architectures (Zhang et al., 2019)

. Aside from developing sophisticated neural network architectures, well-designed loss functions have also been demonstrated to be effective in improving the performance in different recommendation tasks, such as item rating prediction (regression) 

(Ravi and Vairavasundaram, 2016), click-through rate prediction (binary classification) (Guo et al., 2017; Ge et al., 2021), user behavior prediction (multi-class classification) (Zhao et al., 2019), and item retrieval (clustering) (Gao et al., 2020).

To optimize DRS frameworks, most existing works are based on a predefined and fixed loss function, such as mean-squared-error (MSE) or mean-absolute-error

(MAE) loss for regression tasks. Then DRS frameworks are optimized in a back-propagation manner, which computes gradients effectively and efficiently to minimize the given loss on the training dataset. During this process, the key step is to calculate gradients of network parameters for minimizing loss functions. However, it is often unclear whether the gradients generated from a given loss function are optimal. For example, in regression tasks, the MSE loss can ensure that the trained model has no outlier predictions with huge errors, while MAE performs better if we only want a well-rounded model that performs well on the majority 

(Fahrmeir et al., 2007; Chatterjee and Hadi, 2015). Therefore, solely utilizing a predefined and fixed loss function for all data examples, i.e., user-item interactions, cannot guarantee the optimal gradients throughout, especially when the interactions have varied convergence behaviors in the non-stationary environment of online recommendation platforms. In addition, there is often a gap between the model training and evaluation performance in real-world recommender systems. For instance, we usually train a predictive model by minimizing cross-entropy loss in online advertising, and evaluate the model performance by click-through rate (CTR). Consequently, it naturally raises a question - can we incorporate more loss functions in the training phase to enhance the model performance?

Figure 1. Overview of the AutoLoss framework.

Efforts have been made to develop strategies to fuse multiple loss functions, which can take advantage of multiple loss functions in a weighted sum fashion. For example, Panoptic FPN (Kirillov et al., 2019) leverages a grid search to find better loss weights; and UPSNet (Xiong et al., 2019) carefully investigates the weighting scheme of loss functions. However, these works rely on exhaustively or manually search for loss weights from a large candidate space, which would be an extremely costly execution in both computing power and time. Also, they aim to learn a set of unified and static weights over the loss functions, which entirely overlook the different convergence behaviors of data examples. Finally, retraining loss weights is always desired when switching among different DRS frameworks or recommendation datasets, which reduces their generalizability and transferability.

In order to obtain more accurate gradients to improve the recommendation performance and the training efficiency, we propose an automated loss function search framework, AutoLoss, which can dynamically and adaptively select appropriate loss functions for training DRS frameworks. Different from existing searching models with predefined and fixed loss functions, or the loss weights exhaustively or manually searched, the optimal loss function in AutoLoss is automatically selected for each data example in a differentiable manner. The experiments on two datasets demonstrate the effectiveness of the proposed framework. We summarize our major contributions as follows:

  • [leftmargin=*]

  • We propose an end-to-end framework, AutoLoss, which can automatically select the proper loss functions for training DRS frameworks with better recommendation performance and training efficiency;

  • A novel controller network is developed to adaptively adjust the probabilities over multiple loss functions according to different data examples’ dynamic convergence behaviors during training, which enhances the model generalizability between different DRS frameworks and datasets;

  • We empirically demonstrate the effectiveness of the proposed framework on real-world benchmark datasets. Extensive studies verify the importance of model components and the transferability of AutoLoss.

The rest of this paper is organized as follows. In Section 2, we detail the framework of automatically searching the probabilities over multiple loss functions, the architecture of the main DRS network and controller network, and propose an AutoML-based optimization algorithm. Section 3 carries out experiments based on real-world datasets and presents experimental results. Section 4 briefly reviews related work. Finally, Section 5 concludes this work and discusses future work.

2. The Proposed Framework

In this section, we will present an end-to-end framework, AutoLoss, which effectively tackles the aforementioned challenges in Section 1 via automatically and adaptively searching the optimal loss function from several candidates according to data examples’ convergence behaviors. We will first provide an overview of the framework; next detail the architectures of the main DRS network; then introduce the loss function search method with a novel controller network; and finally provide an AutoML-based optimization algorithm.

2.1. An Overview

In this subsection, we will give an overview of the AutoLoss framework. AutoLoss aims to automatically select appropriate loss functions from a set of candidates for different data examples (i.e., user-item interactions). We demonstrate the framework in Figure 1. With a DRS network, a controller network and a set of pre-defined candidate loss functions, the learning process of AutoLoss mainly consists of two major steps.

The forward-propagation step. Given a mini-batch of data examples, the main DRS network first generates predictions based on the input features . Then, we can calculate the losses for each candidate loss function according to the ground truth labels and predictions . Meanwhile, the controller network takes and outputs the probabilities over loss functions for each data example. Finally, the overall loss can be calculated according to the losses from and the probabilities .

The backward-propagation step. We first fix the parameters of the controller network and update the main DRS network parameters upon the training data examples. Then, we fix the DRS parameters and optimize the controller network parameters based on a mini-batch of validation data examples. This alternative updating approach enhances the generalizability, and prevents AutoLoss from selecting probabilities that overfit the training data examples (Pham et al., 2018; Liu et al., 2018). Next, we will introduce the details of AutoLoss.

Figure 2. Architectures of DeepFM and IPNN.

2.2. Deep Recommender System Network

AutoLoss is quite general for most existing deep recommender system frameworks (Rendle, 2010; Guo et al., 2017; Lian et al., 2018; Qu et al., 2016). As visualized in Figure 2, they typically have four components: embedding layer, interaction layer, MLP layer and output layer. We now briefly introduce these components.

2.2.1. Embedding Layer

The raw input features of users and items are usually categorical or numeric, and in the form of multiple fields. Most DRS works first transform the input features into binary vectors, and then embed them into continuous vectors using a field-wise embedding. In this way, a user-item interaction data example

can be represented as the concatenation of binary vectors from all feature fields:

where is the number of feature fields and is the binary vector of the

field. The categorical data are transformed into binary vectors via one-hot encoding, e.g.,

for and for . The numeric data are first partitioned into buckets, and then we have a binary vector for each bucket, e.g., we can use for child whose , for youth whose , for adult whose , and for seniors whose .

Since vector is high-dimensional and very sparse, and different feature fields have various lengths, DRS models usually introduce an embedding layer to transform each binary vector into a low-dimensional continuous vector as:


where is the weight matrix with the number of unique feature values in the feature field, and is the pre-defined size of low-dimensional vectors111For multi-valued features (e.g.,“Interest=Movie, Sports”), the feature embedding is the sum or average of multiple embeddings (Covington et al., 2016).. Finally, the embedding layer will output the concatenation of embedding vectors from all feature fields:


2.2.2. Interaction Layer

After representing the input features as low-dimensional embeddings, DRS models usually develop an interaction layer to explicitly capture the interactions among feature fields. The most widely used method is factorization machine (FM) (Rendle, 2010). In addition to the linear interactions among features, FM can explicitly model the pairwise (second-order) feature interactions via the inner product of feature embeddings:


where is the inner product of two embeddings, and the number of pairwise feature interactions is . Then, the interaction layer will output:


Where is the weight over the binary vector of input features. The first term represents the impact of first-order feature interactions, and the second term reflects the impact of second-order feature interactions. FM can explicitly model even higher order interactions, such as for third-order, but this will add a lot of computation.

2.2.3. MLP Layer

MLP Layer combines and transforms the features, e.g., and , with several fully-connected layers and activations. The output of each layer is:


where is the weight matrix and

is the bias vector for the

hidden layer. is the input of first fully-connected layer, and we denote the final output of MLP layer as .

2.2.4. Output Layer

Finally, the output layer, which is subsequent to the previous layers, will generate the prediction of a user-item interaction data example. The input of output layer can be different in different DRS models, e.g., in DeepFM (Guo et al., 2017) and in IPNN (Qu et al., 2016), shown in Figure 2. The output layer will yield the prediction of the user-item interaction as:


where and

are the weight matrix and bias vector for the output layer. Activation function

is selected based on different recommendation tasks, such as sigmoid for binary classification (Guo et al., 2017), and softmax for multi-class classification (Tan et al., 2016). Finally, given a set of candidate loss functions, such as mean-squared-error, categorical hinge and cross-entropy, we can compute the candidate losses :


where is the ground truth label and is the number of candidate loss functions.

2.3. Loss Function Search

AutoLoss aims to adaptively and automatically search the optimal loss function, which can enhance the prediction quality and training efficiency of the DRS network. This is naturally challenging because of the complex relationship between the DRS parameters and the probabilities over candidate loss functions. To address this challenge, many existing works have focused on developing the fusing strategies for multiple loss functions, which can take advantage of multiple loss functions in a weighted sum manner:


where is the ground truth, is the prediction from DRS network, and is the candidate loss function. The continuous loss weights measure the candidates’ contributions in the final loss function . However, this method relies on exhaustively or manually search of loss weights from a large search space, which is extremely costly. Also, this soft fusing strategy cannot completely eliminate the impact of suboptimal candidate loss functions on the final loss function , thus, a hard selection method is desired. However, hard selection usually leads to the training framework not end-to-end differentiable.

Reinforcement learning (RL) is a potential solution to tackle the hard selection problem. However, since the RL is generally built upon the Markov decision process, it utilizes temporal-difference to make sequential actions. Consequently, the agent can only receive the reward until the optimal loss function is selected and the DRS is evaluated. In other words, the temporal-difference setting can suffer from delayed rewards. To address this issue, we introduce the Gumbel-softmax operation to simulate the hard selection over candidate loss functions, where the non-differentiable sampling is approximated from a categorical distribution based on a differentiable sampling from the Gumbel-softmax distribution (Jang et al., 2016).

Given the continuous loss weights over candidate loss functions, we can draw a hard selection through the Gumbel-max trick (Gumbel, 1948) as:


where and . The independent and identically distributed (i.i.d) gumbel noises disturb the terms. Also, they make the operation equivalent to drawing a sample from loss weights . However, because of the operation, this sampling method is non-differentiable. We tackle this problem by straight-through Gumbel-softmax (Jang et al., 2016), which leverages a softmax function as a differentiable approximation to the operation:


where is the probability of selecting the candidate loss function. The temperature parameter is introduced to manage the smoothness of the Gumbel-softmax operation’s output. Specifically, the output approaches a one-hot vector if is closer to zero. Then the final loss function can be reformulated as:


In conclusion, the loss function search process becomes end-to-end differentiable by introducing the Gumbel-softmax operation with a similar hard selection performance. Next, we will discuss how to generate data example-level loss weights .

2.4. The Controller Network

As in Eq. (8), we suppose that are the original (continuous) class probabilities over candidate loss functions before the Gumbel-softmax operation. This assumption aims to learn a set of unified and static probabilities over the candidate loss functions. However, the environment of real-world commercial recommendation platforms is always non-stationary, and different user-item interaction examples have varying convergence behaviors. This cannot be handled by unified and static probabilities, resulting in suboptimal model performance, generalizability and transferability.

We propose a controller network to address this challenge, which learns to generate original class probabilities for each data example. Motivated by curriculum learning (Bengio et al., 2009; Jiang et al., 2014), the original class probabilities should be generated according to the ground truth labels and the output of DRS network . Therefore, the input of the controller network is a mini-batch , followed by the MLP layer with several fully-connected layers like Eq. (5). Afterwards, the controller’s output layer generates continuous class probabilities for each data example in the mini-batch via a standard softmax activation, where is the size of mini-batch. In other word, each data example has individual probabilities. The controller can enhance the recommendation quality, model generalizability and transferability, which is validated by the extensive experiments.

2.5. An Optimization Method

In above subsections, we formulate the loss function search as an architectural optimization problem and introduce the Gumbel-softmax that makes the framework end-to-end differentiable. Now, we discuss the optimization for the AutoLoss framework.

In AutoLoss, the parameters to be optimized are from two networks. We denote the main DRS network’s parameters as , and the controller network’s parameters as . Note that are directly generated by the Gumbel-softmax operation based on the controller’s output as in Eq. (10

). Inspired by automated machine learning techniques 

(Pham et al., 2018), and

should not be updated on the same training data batch like traditional supervised learning methods. This is because the optimization of them is highly dependent on each other. As a result, updating

and on the same training batch can lead to the model over-fitting on the training examples.

Input: features and ground-truth labels
Output: well-learned parameters and

1:  while not converged do
2:     Sample a mini-batch of validation data examples

     Estimate the approximation of

via Eq.(13)
4:     Update by descending
5:     Sample a mini-batch of training data examples
6:     Update by descending
7:  end while
Algorithm 1 An Optimization Algorithm for AutoLoss via DARTS.

According to the end-to-end differentiable property of AutoLoss, we update and through gradient descent utilizing the differentiable architecture search (DARTS) techniques (Liu et al., 2018). To be specific, and are alternately updated on training and validation batches by minimizing the training loss and validation loss , respectively. This forms a bi-level optimization problem (Pham et al., 2018), where controller parameters and DRS parameters are considered as the upper- and lower-level variables:


where directly optimizing thoroughly via Eq.(12) is intractable since the inner optimization of is extremely costly. To tackle this issue, we use an approximation scheme for the inner optimization:


where is the predefined learning rate. This approximation scheme estimates by descending only one step toward the gradient , rather than optimizing thoroughly. To further enhance the computation efficiency, we can set , i.e., the first-order approximation.

We detail the AutoLoss optimization via DARTS in Algorithm 1. More specifically, in each iteration, we first sample a mini-batch validation data examples of user-item interactions (line 2); next, we estimate (but do not update) via the approximation scheme in Eq.(13) (line 3); then, we update the controller parameters by one step based on the estimation (line 4); afterward, we sample a mini-batch training data examples (line 5); and finally, we update the via descending by one step (line 6).

3. Experiment

This section will conduct extensive experiments using various datasets to evaluate the effectiveness of AutoLoss. We first introduce the experimental settings, then compare AutoLoss with representative baselines, and finally conduct model component and transferability analysis.

3.1. Datasets

We evaluate our model on two datasets, including Criteo and ML-20m. Below we introduce these datasets and more statistics about them can be found in Table 1.

Criteo222 It is a real-world commercial dataset to assess click-through rate prediction models for online ads. It consists of 45 million data examples, i.e., users’ click records on displayed ads. Each example contains anonymous feature fields, where 13 fields are numerical and 26 fields are categorical. 13 numerical fields are converted into categorical features through bucketing.

ML-20m333 This is a benchmark dataset to evaluate recommendation algorithms, which contains 20 million users’ 5-star ratings on movies. The dataset includes 27,278 movies and 138,493 users, i.e., feature fields, where each user has at least 20 ratings.

3.2. Evaluation Metrics

AutoLoss is general for many recommendation tasks. To evaluate its effectiveness, we conduct binary classification (i.e., click-through rate prediction) on Criteo, and multi-class classification (i.e., 5-star ratings) on ML-20m. The two classification experiments are evaluated by AUC444We evaluate the AUC for multiclass classification in a one-vs-rest manner. and Logloss, where higher AUC or lower Logloss mean better performance. It is worth noting that slightly higher AUC and lower Logloss at 0.001-level are considered significant in recommendations (Guo et al., 2017).

Data Criteo ML-20m
# Interactions 45,840,617 20,000,263
# Feature Fields 39 2
# Feature Values 1,086,810 165,771
# Behavior click or not rating 15
Table 1. Statistics of the datasets.

3.3. Implementation

We implement AutoLoss based on a public library555, which contains 16 representative recommendation models. We develop AutoLoss as an independent class, so we can easily apply our framework for all these models. In this paper, we only show the results on DeepFM (Guo et al., 2017) and IPNN (Qu et al., 2016) due to the page limitation. To be specific, AutoLoss framework mainly contains two networks, i.e., the DRS network and the controller network.

Dataset Model Metric Methods
Focal KL Hinge CE MeLU BOHB DARTS SLF AutoLoss
Criteo DeepFM AUC 0.8046 0.8042 0.8049 0.8056 0.8063 0.8065 0.8067 0.8081 0.8092*
Logloss 0.4466 0.4469 0.4463 0.4457 0.4436 0.4435 0.4433 0.4426 0.4416*
Criteo IPNN AUC 0.8077 0.8072 0.8079 0.8085 0.8090 0.8092 0.8093 0.8098 0.8108*
Logloss 0.4435 0.4437 0.4432 0.4428 0.4423 0.4422 0.4423 0.4418 0.4409*
ML-20m DeepFM AUC 0.7681 0.7682 0.7685 0.7692 0.7695 0.7695 0.7696 0.7705 0.7717*
Logloss 1.2320 1.2317 1.2316 1.2310 1.2307 1.2305 1.2305 1.2299 1.2288*
ML-20m IPNN AUC 0.7721 0.7722 0.7725 0.7733 0.7735 0.7734 0.7736 0.7745 0.7756*
Logloss 1.2270 1.2269 1.2266 1.2260 1.2256 1.2257 1.2255 1.2249 1.2236*


” indicates the statistically significant improvements (i.e., two-sided t-test with

) over the best baseline.
: the higher the better; : the lower the better.

Table 2. Performance comparison of different loss function search methods.

For the DRS network, (a) Embedding layer: we set the embedding size as 16 following the existing works (Zhu et al., 2020). (b) Interaction layer: we leverage factorization machine and inner product network to capture the interactions among feature fields for DeepFM and IPNN, respectively. (c) MLP layer

: we have two fully-connected layers, and the layer size is 128. We also employ batch normalization, dropout (

) and ReLU activation for both layers. (d) Output layer: original DeepFM and IPNN are designed for click-through rate prediction, which use sigmoid activation for negative log-likelihood function. To fit the 5-class classification task on ML-20m, we modify the output layer correspondingly. i.e., the output layer is 5-dimensional with softmax activation.

For the controller network, (a) Input layer: the inputs are the ground truth labels and the predictions from DRS network. (b) MLP layer: we also use two fully-connected layers with the layer size 128, batch normalization, dropout () and ReLU activation. (3) Output layer: the controller network will output continuous loss probabilities with softmax activation, whose dimension equals to the number of candidate loss functions.

For other hyper-parameters, (a) Gumbel-softmax: we use an annealing scheme for temperature , where is the training step. (b) Optimization: we set the learning rate as for updating both DRS network and controller network with Adam optimizer and batch-size 2000. (c) We select the hyper-parameters of the AutoLoss framework via cross-validation, and we also do parameter-tuning for baselines correspondingly for a fair comparison.

3.4. Overall Performance Comparison

AutoLoss is compared with the following loss function design and search methods:

  • [leftmargin=*]

  • Fixed loss function: the first group of baselines leverages a predefined and fixed loss function. We utilize Focal loss, KL divergence, Hinge loss and cross-entropy (CE) loss for both classification tasks.

  • Fixed weights over loss functions: this group of baselines aims to learn fixed weights over the loss functions in the first group, without considering the difference among data examples. In this group, we use the meta-learning method MeLU (Lee et al., 2019), as well as automated machine learning methods BOHB (Falkner et al., 2018) and DARTS (Liu et al., 2018).

  • Data example-wise loss weights: this group learns to assign different loss weights for different data examples according to their convergence behaviors. One existing work, stochastic loss function (SLF) (Liu and Lai, 2020), belongs to this group.

The overall performance is shown in Table 2. It can be observed that: (i) The first group of baselines achieves the worst recommendation performance in both recommendation tasks. Their optimizations are based on predefined and fixed loss functions during the training stage. This result demonstrates that leveraging a predefined and fixed loss function throughout can downgrade the recommendation quality. (ii) The methods in the second group outperform those in the first group. These methods try to learn weights over candidate loss functions according to their contributions to the optimization, and then combine them in a weighted sum manner. This validates that incorporating multiple loss functions in optimization can enhance the performance of deep recommender systems. (iii) The second group performs worse than the SLF, since the weights they learned are unified and static, which completely overlooks the various behaviors among different data examples. Therefore, SLF performs better by taking this factor into account. (iv) The decision network of SLF is optimized on the same training batch with the main DRS network via back-propagation, which can lead to over-fitting on the training batch. AutoLoss updates the DRS network on the training batch while updating the controller on the validation batch, which improves the model generalizability and results in better recommendation performance.

To summarize, AutoLoss achieves significantly better performance than state-of-the-art baselines on both datasets and tasks, which demonstrates its effectiveness.

3.5. Transferability Study

In this subsection, we study the transferability of the controller. Specifically, we want to investigate (i) whether the controller trained with one DRS model can be applied to other DRS models; and (ii) whether the controller learned on one dataset can be directly used on other datasets.

Figure 3. Transferability study results.

To study the transferability across different DRS models, we leverage the controller trained via DeepFM and AutoLoss on Criteo, fix its parameters and apply it to train NFM (He and Chua, 2017) and AutoInt (Song et al., 2019) on Criteo. The results are demonstrated in Figure 3 (a)-(d), where (i) “” means that we directly train the new DRS model via minimizing the cross-entropy (CE) loss, which is the best single and fixed loss function in Table 2; (ii) “” is that we use the controller upon DeepFM and SLF, which is the best baseline in Table 2; and (iii) “” denotes that we use the controller based on DeepFM and AutoLoss. From the figures, we can observe that performs superior to , which indicates that a pre-trained controller can improve other DRS models’ training performance. More importantly, outperforms , which validates AutoLoss’s better transferability across different DRS models.

To study the transferability between different datasets, we train a controller upon Criteo dataset with DeepFM and AutoLoss, fix its parameters and apply it to train a new DeepFM on the Avazu dataset666Avazu is another benchmark dataset for CTR prediction, which contains 40 million user clicking behaviors in 11 days with categorical feature fields., i.e., “”. Also, we denote that (i) “”: DeepFM is directly optimized by minimizing cross-entropy (CE) loss on Avazu dataset; and (ii) “”: DeepFM is optimized on the new dataset with the assistance of a controller pre-trained with DeepFM+SLF on Criteo. In Figure 3 (e)-(f), shows superior performance over and , which proves its better transferability between different datasets.

In summary, AutoLoss has better transferability across different DRS models and different recommendation datasets, which improves its usability in real-world recommender systems.

3.6. Impact of Model Components

In this subsection, in order to understand the contributions of important model components of AutoLoss, we systematically eliminate each component and define the following variants:

  • [leftmargin=*]

  • AL-1: This variant aims to assess the contribution of the controller. Thus, we assign equivalent weights on four candidate loss functions, i.e., .

  • AL-2: In this variant, we eliminate the Gumbel-softmax operation, and directly use the controller’s output, i.e., the continuous loss probabilities from standard softmax activation, which aims to evaluate the impact of Gumbel-softmax.

Dataset Model Metric Methods
AL-1 AL-2 AutoLoss
Criteo DeepFM AUC 0.8052 0.8083 0.8092*
Logloss 0.4460 0.4422 0.4416*
Criteo IPNN AUC 0.8081 0.8102 0.8108*
Logloss 0.4431 0.4416 0.4409*

*” indicates the statistically significant improvements (i.e., two-sided t-test with ) over the best baseline.
: the higher the better; : the lower the better.

Table 3. Impact of model components.

The results on the Criteo dataset are shown in Table 3. First, AL-1 has worse performance than AutoLoss, which validates the necessity to introduce the controller network. It is noteworthy that, AL-1 performs worse than all loss function search methods, and even the fixed cross-entropy (CE) loss in Table 2, which indicates that equally incorporating all candidate loss functions cannot guarantee better performance. Second, AutoLoss outperforms AL-2. The main reason is that AL-2 always generates gradients based on all the loss functions, which introduces some noisy gradients from the suboptimal candidate loss functions. In contrast, AutoLoss can obtain appropriate gradients by filtering out suboptimal loss functions via Gumbel-softmax, which enhances the model robustness.

Figure 4. Efficiency study results.

3.7. Efficiency Study

This section compares AutoLoss’s training efficiency with other loss function searching methods, which is an important metric to deploy a DRS model in real-world applications. Our experiments are based on one GeForce GTX 1060 GPU.

The results of DeepFM on Criteo dataset are illustrated in Figure 4 (a). We can observe that AutoLoss achieves the fastest training speed. The reasons are two-fold. First, AutoLoss can generate the most appropriate gradients to update DRS, which increases the optimization efficiency. Second, we update the controller once after every 7 times DRS is updated, i.e., the controller updating frequency . This trick not only reduces the training time ( in this case) with fewer computations, but also enhances the performance. In Figure 4 (b)-(c) where -axis is , we find that DeepFM performs the best when , while updating too frequently/infrequently can lead to suboptimal AUC/Logloss.

To summarize, AutoLoss can efficiently achieve better performance, making it easier to be launched in real-world recommender systems.

4. Related Work

In this section, we briefly introduce the works related to our study. We first go over the latest studies in loss function search and then review works about AutoML for recommendations.

4.1. Loss Function Search

The loss function plays an essential part in a deep learning framework. The choice of the loss function significantly affects the performance of the learned model. A lot of efforts have been made to design desirable loss functions for specific tasks. For example, in the field of image processing, Rahman and Wang (2016)

argued that the typical cross-entropy loss for semantic segmentation shows great limitations in aligning with evaluation metrics other than global accuracy.

Ronneberger et al. (2015); Wu et al. (2016b) designed loss functions by taking class frequency into consideration to cater to the mIoU metric. Caliva et al. (2019); Qin et al. (2019) designed losses with larger weights at boundary regions to improve the boundary F1 score. Liu et al. (2016) proposed to replace the traditional Softmax loss with large margin Softmax (L-Softmax) loss to improve feature discrimination in classification tasks. Fan et al. (2019) used sphere Softmax loss for the person re-identification task and obtained state-of-the-art results. The loss functions mentioned above are all designed manually, requiring ample expert knowledge, non-trivial time, and many human efforts.

Recently, automated loss function search draws increasing interests of researchers from various machine learning (ML) fields. Xu et al. (2018) investigated how to automatically schedule iterative and alternate optimization processes for ML models. Liu and Lai (2020) proposed to optimize the stochastic loss function (SLF), where the loss function of an ML model was dynamically selected. During training, model parameters and the loss parameters are learned jointly. Li et al. (2020) proposed automatically searching specific surrogate losses to improve different evaluation metrics in the image semantic segmentation task. Jin et al. (2021) composed multiple self-supervised learning tasks to jointly encode multiple sources of information and produce more generalizable representations, and developed two automated frameworks to search the task weights. Besides, Li et al. (2019); Wang et al. (2020)

designed search spaces for a series of existing loss functions and developed algorithms to search for the best parameters of the probability distribution for sampling loss functions.

4.2. AutoML for Recommendation

AutoML techniques are now widely used to automatically design deep recommendation systems. Previous works mainly focused on the design of the embedding layer and the selection of feature interaction patterns.

In terms of the embedding layer, Joglekar et al. (2020); Zhao et al. (2020a); Liu et al. (2021) proposed novel methods to automatically select the best embedding size for different feature fields in a recommendation system. Zhao et al. (2020b); Liu et al. (2020b) proposed to dynamically search embedding sizes for users and items based on their popularity in the streaming setting. Similarly, Ginart et al. (2019) proposed to use mixed dimension embeddings for users and items based on their query frequency. Kang et al. (2020) proposed a multi-granular quantized embeddings (MGQE) technique to learn impact embeddings for infrequent items. Cheng et al. (2020) proposed to perform embedding dimension selection with a soft selection layer, making the dimension selection more flexible. Guo et al. (2020) focused on the embeddings of numerical features. They proposed AutoDis, which automatically discretizes features in numerical fields and maps the resulting categorical features into embeddings.

As for feature interaction, Luo et al. (2019) proposed AutoCross that produces high-order cross features by performing beam search in a tree-structure feature space. Song et al. (2020); Khawar et al. (2020); Liu et al. (2020a); Xue et al. (2020) proposed to automatically discover feature interaction architectures for click-through rate (CTR) prediction. Tsang et al. (2020) proposed a method to interpret the feature interactions from a source recommendation model and apply them in a target recommendation model.

To the best of our knowledge, we are the first to investigate the automated loss function search for deep recommendation systems.

5. Conclusion

We propose a novel end-to-end framework, AutoLoss, to enhance recommendation performance and deep recommender systems’ training efficiency by selecting appropriate loss functions in a data-driven manner. AutoLoss can automatically select the proper loss function for each data example according to their varied convergence behaviors. To be specific, we first develop a novel controller network, which generates continuous loss weights based on the ground truth labels and the DRS’ predictions. Then, we introduce a Gumbel-softmax operation to simulate the hard selection over candidate loss functions, which filters out the noisy gradients from suboptimal candidates. Finally, we can select the optimal candidate according to the output from Gumbel-softmax. We conduct extensive experiments to validate the effectiveness of AutoLoss on two widely used benchmark datasets. The results show that our framework can improve recommendation performance and training efficiency with excellent transferability.


This work is supported by National Science Foundation (NSF) under grant numbers IIS1907704, IIS1928278, IIS1714741, IIS1715940, IIS1845081, CNS1815636, and an internal research fund from the Hong Kong Polytechnic University (project no. P0036200).


  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §2.4.
  • F. Caliva, C. Iriondo, A. M. Martinez, S. Majumdar, and V. Pedoia (2019) Distance map loss penalty term for semantic segmentation. arXiv preprint arXiv:1908.03679. Cited by: §4.1.
  • S. Chatterjee and A. S. Hadi (2015) Regression analysis by example. John Wiley & Sons. Cited by: §1.
  • W. Cheng, Y. Shen, and L. Huang (2020) Differentiable neural input search for recommender systems. arXiv preprint arXiv:2006.04466. Cited by: §4.2.
  • P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: footnote 1.
  • L. Fahrmeir, T. Kneib, S. Lang, and B. Marx (2007) Regression. Springer. Cited by: §1.
  • S. Falkner, A. Klein, and F. Hutter (2018)

    BOHB: robust and efficient hyperparameter optimization at scale

    In International Conference on Machine Learning, pp. 1437–1446. Cited by: 2nd item.
  • W. Fan, T. Derr, X. Zhao, Y. Ma, H. Liu, J. Wang, J. Tang, and Q. Li (2020) Attacking black-box recommendations via copying cross-domain user profiles. arXiv preprint arXiv:2005.08147. Cited by: §1.
  • X. Fan, W. Jiang, H. Luo, and M. Fei (2019) Spherereid: deep hypersphere manifold embedding for person re-identification. Journal of Visual Communication and Image Representation 60, pp. 51–58. Cited by: §4.1.
  • W. Gao, X. Fan, J. Sun, K. Jia, W. Xiao, C. Wang, and X. Liu (2020) Deep retrieval: an end-to-end learnable structure model for large-scale recommendations. arXiv preprint arXiv:2007.07203. Cited by: §1.
  • Y. Ge, S. Liu, R. Gao, Y. Xian, Y. Li, X. Zhao, C. Pei, F. Sun, J. Ge, W. Ou, et al. (2021) Towards long-term fairness in recommendation. arXiv preprint arXiv:2101.03584. Cited by: §1.
  • A. Ginart, M. Naumov, D. Mudigere, J. Yang, and J. Zou (2019) Mixed dimension embeddings with application to memory-efficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §4.2.
  • E. J. Gumbel (1948) Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §2.3.
  • H. Guo, X. Li, M. He, X. Zhao, G. Liu, and G. Xu (2016) CoSoLoRec: joint factor model with content, social, location for heterogeneous point-of-interest recommendation. In International Conference on Knowledge Science, Engineering and Management, pp. 613–627. Cited by: §1.
  • H. Guo, B. Chen, R. Tang, Z. Li, and X. He (2020) AutoDis: automatic discretization for embedding numerical features in ctr prediction. arXiv preprint arXiv:2012.08986. Cited by: §4.2.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    pp. 1725–1731. Cited by: §1, §2.2.4, §2.2, §3.2, §3.3.
  • X. He and T. Chua (2017) Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: §3.5.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §2.3, §2.3.
  • L. Jiang, D. Meng, S. Yu, Z. Lan, S. Shan, and A. Hauptmann (2014) Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pp. 2078–2086. Cited by: §2.4.
  • W. Jin, X. Liu, X. Zhao, Y. Ma, N. Shah, and J. Tang (2021) Automated self-supervised learning for graphs. External Links: 2106.05470 Cited by: §4.1.
  • M. R. Joglekar, C. Li, M. Chen, T. Xu, X. Wang, J. K. Adams, P. Khaitan, J. Liu, and Q. V. Le (2020) Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2387–2397. Cited by: §4.2.
  • W. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, L. Hong, and E. H. Chi (2020) Learning multi-granular quantized embeddings for large-vocab categorical features in recommender systems. In Companion Proceedings of the Web Conference 2020, pp. 562–566. Cited by: §4.2.
  • F. Khawar, X. Hang, R. Tang, B. Liu, Z. Li, and X. He (2020) AutoFeature: searching for feature interactions and their architectures for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 625–634. Cited by: §4.2.
  • A. Kirillov, R. Girshick, K. He, and P. Dollár (2019) Panoptic feature pyramid networks. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 6399–6408. Cited by: §1.
  • H. Lee, J. Im, S. Jang, H. Cho, and S. Chung (2019) MeLU: meta-learned user preference estimator for cold-start recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1073–1082. Cited by: 2nd item.
  • C. Li, X. Yuan, C. Lin, M. Guo, W. Wu, J. Yan, and W. Ouyang (2019) Am-lfs: automl for loss function search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8410–8419. Cited by: §4.1.
  • H. Li, C. Tao, X. Zhu, X. Wang, G. Huang, and J. Dai (2020) Auto seg-loss: searching metric surrogates for semantic segmentation. arXiv preprint arXiv:2010.07930. Cited by: §4.1.
  • J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun (2018) Xdeepfm: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: §2.2.
  • B. Liu, C. Zhu, G. Li, W. Zhang, J. Lai, R. Tang, X. He, Z. Li, and Y. Yu (2020a) Autofis: automatic feature interaction selection in factorization models for click-through rate prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2636–2645. Cited by: §4.2.
  • H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.1, §2.5, 2nd item.
  • H. Liu, X. Zhao, C. Wang, X. Liu, and J. Tang (2020b) Automated embedding size search in deep recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2307–2316. Cited by: §4.2.
  • Q. Liu and J. Lai (2020) Stochastic loss function. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 4884–4891. Cited by: 3rd item, §4.1.
  • S. Liu, C. Gao, Y. Chen, D. Jin, and Y. Li (2021) Learnable embedding sizes for recommender systems. arXiv preprint arXiv:2101.07577. Cited by: §4.2.
  • W. Liu, Y. Wen, Z. Yu, and M. Yang (2016)

    Large-margin softmax loss for convolutional neural networks.

    In ICML, Vol. 2, pp. 7. Cited by: §4.1.
  • Y. Liu, T. N. Pham, G. Cong, and Q. Yuan (2017) An experimental evaluation of point-of-interest recommendation in location-based social networks. Proceedings of the VLDB Endowment 10 (10), pp. 1010–1021. Cited by: §1.
  • Y. Luo, M. Wang, H. Zhou, Q. Yao, W. Tu, Y. Chen, W. Dai, and Q. Yang (2019) Autocross: automatic feature crossing for tabular data in real-world applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1936–1945. Cited by: §4.2.
  • H. T. Nguyen, M. Wistuba, J. Grabocka, L. R. Drumond, and L. Schmidt-Thieme (2017) Personalized deep learning for tag recommendation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, Cited by: §1.
  • H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §2.1, §2.5, §2.5.
  • X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand (2019) Basnet: boundary-aware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7479–7489. Cited by: §4.1.
  • Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang (2016) Product-based neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1149–1154. Cited by: §2.2.4, §2.2, §3.3.
  • M. A. Rahman and Y. Wang (2016) Optimizing intersection-over-union in deep neural networks for image segmentation. In International symposium on visual computing, pp. 234–244. Cited by: §4.1.
  • L. Ravi and S. Vairavasundaram (2016) A collaborative location based travel recommendation system through enhanced rating prediction for the group of users. Computational intelligence and neuroscience 2016. Cited by: §1.
  • S. Rendle (2010) Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000. Cited by: §2.2.2, §2.2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.1.
  • Q. Song, D. Cheng, H. Zhou, J. Yang, Y. Tian, and X. Hu (2020) Towards automated neural interaction discovery for click-through rate prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 945–955. Cited by: §4.2.
  • W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang (2019) Autoint: automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: §3.5.
  • Y. K. Tan, X. Xu, and Y. Liu (2016)

    Improved recurrent neural networks for session-based recommendations

    In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 17–22. Cited by: §2.2.4.
  • M. Tsang, D. Cheng, H. Liu, X. Feng, E. Zhou, and Y. Liu (2020) Feature interaction interpretability: a case for explaining ad-recommendation systems via neural interaction detection. arXiv preprint arXiv:2006.10966. Cited by: §4.2.
  • X. Wang, S. Wang, C. Chi, S. Zhang, and T. Mei (2020)

    Loss function search for face recognition

    In International Conference on Machine Learning, pp. 10029–10038. Cited by: §4.1.
  • S. Wu, W. Ren, C. Yu, G. Chen, D. Zhang, and J. Zhu (2016a) Personal recommendation using deep recurrent neural networks in netease. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pp. 1218–1229. Cited by: §1.
  • Z. Wu, C. Shen, and A. v. d. Hengel (2016b) Bridging category-level and instance-level semantic image segmentation. arXiv preprint arXiv:1605.06885. Cited by: §4.1.
  • Y. Xiong, R. Liao, H. Zhao, R. Hu, M. Bai, E. Yumer, and R. Urtasun (2019) Upsnet: a unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8826. Cited by: §1.
  • H. Xu, H. Zhang, Z. Hu, X. Liang, R. Salakhutdinov, and E. Xing (2018) Autoloss: learning discrete schedules for alternate optimization. arXiv preprint arXiv:1810.02442. Cited by: §4.1.
  • N. Xue, B. Liu, H. Guo, R. Tang, F. Zhou, S. P. Zafeiriou, Y. Zhang, J. Wang, and Z. Li (2020) AutoHash: learning higher-order feature interactions for deep ctr prediction. IEEE Transactions on Knowledge and Data Engineering. Cited by: §4.2.
  • S. Zhang, L. Yao, A. Sun, and Y. Tay (2019) Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR) 52 (1), pp. 1–38. Cited by: §1.
  • X. Zhao, C. Gu, H. Zhang, X. Yang, X. Liu, H. Liu, and J. Tang (2021) DEAR: deep reinforcement learning for online advertising impression in recommender systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 750–758. Cited by: §1.
  • X. Zhao, H. Liu, H. Liu, J. Tang, W. Guo, J. Shi, S. Wang, H. Gao, and B. Long (2020a) Memory-efficient embedding for recommendations. arXiv preprint arXiv:2006.14827. Cited by: §4.2.
  • X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, and J. Tang (2020b) Autoemb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §4.2.
  • X. Zhao, L. Xia, Z. Ding, D. Yin, and J. Tang (2019) Toward simulating environments in reinforcement learning based recommendations. arXiv preprint arXiv:1906.11462. Cited by: §1.
  • X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang (2018a) Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Recommender Systems Conference, pp. 95–103. Cited by: §1.
  • X. Zhao, L. Xia, L. Zou, H. Liu, D. Yin, and J. Tang (2020c) Whole-chain recommendations. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1883–1891. Cited by: §1.
  • X. Zhao, T. Xu, Q. Liu, and H. Guo (2016) Exploring the choice under conflict for social event participation. In International Conference on Database Systems for Advanced Applications, pp. 396–411. Cited by: §1.
  • X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin (2018b) Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1040–1048. Cited by: §1.
  • X. Zhao, X. Zheng, X. Yang, X. Liu, and J. Tang (2020d) Jointly learning to recommend and advertise. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3319–3327. Cited by: §1.
  • J. Zhu, J. Liu, S. Yang, Q. Zhang, and X. He (2020) FuxiCTR: an open benchmark for click-through rate prediction. arXiv preprint arXiv:2009.05794. Cited by: §3.3.
  • L. Zou, L. Xia, Y. Gu, X. Zhao, W. Liu, J. X. Huang, and D. Yin (2020) Neural interactive collaborative filtering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 749–758. Cited by: §1.