1. Introduction
In the era of information explosion, recommender systems play a pivotal role in alleviating information overload, which vastly enhance user experiences in many commercial applications, such as generating playlists in video and music services (Zhao et al., 2021, 2020d), recommending products in online stores (Zhao et al., 2020c; Zou et al., 2020; Fan et al., 2020; Zhao et al., 2018a, b), and suggesting locations for geosocial events (Liu et al., 2017; Zhao et al., 2016; Guo et al., 2016)
. With the recent growth of deep learning techniques, there have been increasing interests in developing deep recommender systems (DRS)
(Nguyen et al., 2017; Wu et al., 2016a). DRS has improved the recommendation quality since they can effectively learn feature representations and capture the nonlinear relationships between users and items via deep architectures (Zhang et al., 2019). Aside from developing sophisticated neural network architectures, welldesigned loss functions have also been demonstrated to be effective in improving the performance in different recommendation tasks, such as item rating prediction (regression)
(Ravi and Vairavasundaram, 2016), clickthrough rate prediction (binary classification) (Guo et al., 2017; Ge et al., 2021), user behavior prediction (multiclass classification) (Zhao et al., 2019), and item retrieval (clustering) (Gao et al., 2020).To optimize DRS frameworks, most existing works are based on a predefined and fixed loss function, such as meansquarederror (MSE) or meanabsoluteerror
(MAE) loss for regression tasks. Then DRS frameworks are optimized in a backpropagation manner, which computes gradients effectively and efficiently to minimize the given loss on the training dataset. During this process, the key step is to calculate gradients of network parameters for minimizing loss functions. However, it is often unclear whether the gradients generated from a given loss function are optimal. For example, in regression tasks, the MSE loss can ensure that the trained model has no outlier predictions with huge errors, while MAE performs better if we only want a wellrounded model that performs well on the majority
(Fahrmeir et al., 2007; Chatterjee and Hadi, 2015). Therefore, solely utilizing a predefined and fixed loss function for all data examples, i.e., useritem interactions, cannot guarantee the optimal gradients throughout, especially when the interactions have varied convergence behaviors in the nonstationary environment of online recommendation platforms. In addition, there is often a gap between the model training and evaluation performance in realworld recommender systems. For instance, we usually train a predictive model by minimizing crossentropy loss in online advertising, and evaluate the model performance by clickthrough rate (CTR). Consequently, it naturally raises a question  can we incorporate more loss functions in the training phase to enhance the model performance?Efforts have been made to develop strategies to fuse multiple loss functions, which can take advantage of multiple loss functions in a weighted sum fashion. For example, Panoptic FPN (Kirillov et al., 2019) leverages a grid search to find better loss weights; and UPSNet (Xiong et al., 2019) carefully investigates the weighting scheme of loss functions. However, these works rely on exhaustively or manually search for loss weights from a large candidate space, which would be an extremely costly execution in both computing power and time. Also, they aim to learn a set of unified and static weights over the loss functions, which entirely overlook the different convergence behaviors of data examples. Finally, retraining loss weights is always desired when switching among different DRS frameworks or recommendation datasets, which reduces their generalizability and transferability.
In order to obtain more accurate gradients to improve the recommendation performance and the training efficiency, we propose an automated loss function search framework, AutoLoss, which can dynamically and adaptively select appropriate loss functions for training DRS frameworks. Different from existing searching models with predefined and fixed loss functions, or the loss weights exhaustively or manually searched, the optimal loss function in AutoLoss is automatically selected for each data example in a differentiable manner. The experiments on two datasets demonstrate the effectiveness of the proposed framework. We summarize our major contributions as follows:

[leftmargin=*]

We propose an endtoend framework, AutoLoss, which can automatically select the proper loss functions for training DRS frameworks with better recommendation performance and training efficiency;

A novel controller network is developed to adaptively adjust the probabilities over multiple loss functions according to different data examples’ dynamic convergence behaviors during training, which enhances the model generalizability between different DRS frameworks and datasets;

We empirically demonstrate the effectiveness of the proposed framework on realworld benchmark datasets. Extensive studies verify the importance of model components and the transferability of AutoLoss.
The rest of this paper is organized as follows. In Section 2, we detail the framework of automatically searching the probabilities over multiple loss functions, the architecture of the main DRS network and controller network, and propose an AutoMLbased optimization algorithm. Section 3 carries out experiments based on realworld datasets and presents experimental results. Section 4 briefly reviews related work. Finally, Section 5 concludes this work and discusses future work.
2. The Proposed Framework
In this section, we will present an endtoend framework, AutoLoss, which effectively tackles the aforementioned challenges in Section 1 via automatically and adaptively searching the optimal loss function from several candidates according to data examples’ convergence behaviors. We will first provide an overview of the framework; next detail the architectures of the main DRS network; then introduce the loss function search method with a novel controller network; and finally provide an AutoMLbased optimization algorithm.
2.1. An Overview
In this subsection, we will give an overview of the AutoLoss framework. AutoLoss aims to automatically select appropriate loss functions from a set of candidates for different data examples (i.e., useritem interactions). We demonstrate the framework in Figure 1. With a DRS network, a controller network and a set of predefined candidate loss functions, the learning process of AutoLoss mainly consists of two major steps.
The forwardpropagation step. Given a minibatch of data examples, the main DRS network first generates predictions based on the input features . Then, we can calculate the losses for each candidate loss function according to the ground truth labels and predictions . Meanwhile, the controller network takes and outputs the probabilities over loss functions for each data example. Finally, the overall loss can be calculated according to the losses from and the probabilities .
The backwardpropagation step. We first fix the parameters of the controller network and update the main DRS network parameters upon the training data examples. Then, we fix the DRS parameters and optimize the controller network parameters based on a minibatch of validation data examples. This alternative updating approach enhances the generalizability, and prevents AutoLoss from selecting probabilities that overfit the training data examples (Pham et al., 2018; Liu et al., 2018). Next, we will introduce the details of AutoLoss.
2.2. Deep Recommender System Network
AutoLoss is quite general for most existing deep recommender system frameworks (Rendle, 2010; Guo et al., 2017; Lian et al., 2018; Qu et al., 2016). As visualized in Figure 2, they typically have four components: embedding layer, interaction layer, MLP layer and output layer. We now briefly introduce these components.
2.2.1. Embedding Layer
The raw input features of users and items are usually categorical or numeric, and in the form of multiple fields. Most DRS works first transform the input features into binary vectors, and then embed them into continuous vectors using a fieldwise embedding. In this way, a useritem interaction data example
can be represented as the concatenation of binary vectors from all feature fields:where is the number of feature fields and is the binary vector of the
field. The categorical data are transformed into binary vectors via onehot encoding, e.g.,
for and for . The numeric data are first partitioned into buckets, and then we have a binary vector for each bucket, e.g., we can use for child whose , for youth whose , for adult whose , and for seniors whose .Since vector is highdimensional and very sparse, and different feature fields have various lengths, DRS models usually introduce an embedding layer to transform each binary vector into a lowdimensional continuous vector as:
(1) 
where is the weight matrix with the number of unique feature values in the feature field, and is the predefined size of lowdimensional vectors^{1}^{1}1For multivalued features (e.g.,“Interest=Movie, Sports”), the feature embedding is the sum or average of multiple embeddings (Covington et al., 2016).. Finally, the embedding layer will output the concatenation of embedding vectors from all feature fields:
(2) 
2.2.2. Interaction Layer
After representing the input features as lowdimensional embeddings, DRS models usually develop an interaction layer to explicitly capture the interactions among feature fields. The most widely used method is factorization machine (FM) (Rendle, 2010). In addition to the linear interactions among features, FM can explicitly model the pairwise (secondorder) feature interactions via the inner product of feature embeddings:
(3) 
where is the inner product of two embeddings, and the number of pairwise feature interactions is . Then, the interaction layer will output:
(4) 
Where is the weight over the binary vector of input features. The first term represents the impact of firstorder feature interactions, and the second term reflects the impact of secondorder feature interactions. FM can explicitly model even higher order interactions, such as for thirdorder, but this will add a lot of computation.
2.2.3. MLP Layer
MLP Layer combines and transforms the features, e.g., and , with several fullyconnected layers and activations. The output of each layer is:
(5) 
where is the weight matrix and
is the bias vector for the
hidden layer. is the input of first fullyconnected layer, and we denote the final output of MLP layer as .2.2.4. Output Layer
Finally, the output layer, which is subsequent to the previous layers, will generate the prediction of a useritem interaction data example. The input of output layer can be different in different DRS models, e.g., in DeepFM (Guo et al., 2017) and in IPNN (Qu et al., 2016), shown in Figure 2. The output layer will yield the prediction of the useritem interaction as:
(6) 
where and
are the weight matrix and bias vector for the output layer. Activation function
is selected based on different recommendation tasks, such as sigmoid for binary classification (Guo et al., 2017), and softmax for multiclass classification (Tan et al., 2016). Finally, given a set of candidate loss functions, such as meansquarederror, categorical hinge and crossentropy, we can compute the candidate losses :(7) 
where is the ground truth label and is the number of candidate loss functions.
2.3. Loss Function Search
AutoLoss aims to adaptively and automatically search the optimal loss function, which can enhance the prediction quality and training efficiency of the DRS network. This is naturally challenging because of the complex relationship between the DRS parameters and the probabilities over candidate loss functions. To address this challenge, many existing works have focused on developing the fusing strategies for multiple loss functions, which can take advantage of multiple loss functions in a weighted sum manner:
(8)  
s.t. 
where is the ground truth, is the prediction from DRS network, and is the candidate loss function. The continuous loss weights measure the candidates’ contributions in the final loss function . However, this method relies on exhaustively or manually search of loss weights from a large search space, which is extremely costly. Also, this soft fusing strategy cannot completely eliminate the impact of suboptimal candidate loss functions on the final loss function , thus, a hard selection method is desired. However, hard selection usually leads to the training framework not endtoend differentiable.
Reinforcement learning (RL) is a potential solution to tackle the hard selection problem. However, since the RL is generally built upon the Markov decision process, it utilizes temporaldifference to make sequential actions. Consequently, the agent can only receive the reward until the optimal loss function is selected and the DRS is evaluated. In other words, the temporaldifference setting can suffer from delayed rewards. To address this issue, we introduce the Gumbelsoftmax operation to simulate the hard selection over candidate loss functions, where the nondifferentiable sampling is approximated from a categorical distribution based on a differentiable sampling from the Gumbelsoftmax distribution (Jang et al., 2016).
Given the continuous loss weights over candidate loss functions, we can draw a hard selection through the Gumbelmax trick (Gumbel, 1948) as:
(9) 
where and . The independent and identically distributed (i.i.d) gumbel noises disturb the terms. Also, they make the operation equivalent to drawing a sample from loss weights . However, because of the operation, this sampling method is nondifferentiable. We tackle this problem by straightthrough Gumbelsoftmax (Jang et al., 2016), which leverages a softmax function as a differentiable approximation to the operation:
(10) 
where is the probability of selecting the candidate loss function. The temperature parameter is introduced to manage the smoothness of the Gumbelsoftmax operation’s output. Specifically, the output approaches a onehot vector if is closer to zero. Then the final loss function can be reformulated as:
(11) 
In conclusion, the loss function search process becomes endtoend differentiable by introducing the Gumbelsoftmax operation with a similar hard selection performance. Next, we will discuss how to generate data examplelevel loss weights .
2.4. The Controller Network
As in Eq. (8), we suppose that are the original (continuous) class probabilities over candidate loss functions before the Gumbelsoftmax operation. This assumption aims to learn a set of unified and static probabilities over the candidate loss functions. However, the environment of realworld commercial recommendation platforms is always nonstationary, and different useritem interaction examples have varying convergence behaviors. This cannot be handled by unified and static probabilities, resulting in suboptimal model performance, generalizability and transferability.
We propose a controller network to address this challenge, which learns to generate original class probabilities for each data example. Motivated by curriculum learning (Bengio et al., 2009; Jiang et al., 2014), the original class probabilities should be generated according to the ground truth labels and the output of DRS network . Therefore, the input of the controller network is a minibatch , followed by the MLP layer with several fullyconnected layers like Eq. (5). Afterwards, the controller’s output layer generates continuous class probabilities for each data example in the minibatch via a standard softmax activation, where is the size of minibatch. In other word, each data example has individual probabilities. The controller can enhance the recommendation quality, model generalizability and transferability, which is validated by the extensive experiments.
2.5. An Optimization Method
In above subsections, we formulate the loss function search as an architectural optimization problem and introduce the Gumbelsoftmax that makes the framework endtoend differentiable. Now, we discuss the optimization for the AutoLoss framework.
In AutoLoss, the parameters to be optimized are from two networks. We denote the main DRS network’s parameters as , and the controller network’s parameters as . Note that are directly generated by the Gumbelsoftmax operation based on the controller’s output as in Eq. (10
). Inspired by automated machine learning techniques
(Pham et al., 2018), andshould not be updated on the same training data batch like traditional supervised learning methods. This is because the optimization of them is highly dependent on each other. As a result, updating
and on the same training batch can lead to the model overfitting on the training examples.According to the endtoend differentiable property of AutoLoss, we update and through gradient descent utilizing the differentiable architecture search (DARTS) techniques (Liu et al., 2018). To be specific, and are alternately updated on training and validation batches by minimizing the training loss and validation loss , respectively. This forms a bilevel optimization problem (Pham et al., 2018), where controller parameters and DRS parameters are considered as the upper and lowerlevel variables:
(12)  
where directly optimizing thoroughly via Eq.(12) is intractable since the inner optimization of is extremely costly. To tackle this issue, we use an approximation scheme for the inner optimization:
(13) 
where is the predefined learning rate. This approximation scheme estimates by descending only one step toward the gradient , rather than optimizing thoroughly. To further enhance the computation efficiency, we can set , i.e., the firstorder approximation.
We detail the AutoLoss optimization via DARTS in Algorithm 1. More specifically, in each iteration, we first sample a minibatch validation data examples of useritem interactions (line 2); next, we estimate (but do not update) via the approximation scheme in Eq.(13) (line 3); then, we update the controller parameters by one step based on the estimation (line 4); afterward, we sample a minibatch training data examples (line 5); and finally, we update the via descending by one step (line 6).
3. Experiment
This section will conduct extensive experiments using various datasets to evaluate the effectiveness of AutoLoss. We first introduce the experimental settings, then compare AutoLoss with representative baselines, and finally conduct model component and transferability analysis.
3.1. Datasets
We evaluate our model on two datasets, including Criteo and ML20m. Below we introduce these datasets and more statistics about them can be found in Table 1.
Criteo^{2}^{2}2https://www.kaggle.com/c/criteodisplayadchallenge/: It is a realworld commercial dataset to assess clickthrough rate prediction models for online ads. It consists of 45 million data examples, i.e., users’ click records on displayed ads. Each example contains anonymous feature fields, where 13 fields are numerical and 26 fields are categorical. 13 numerical fields are converted into categorical features through bucketing.
ML20m^{3}^{3}3https://grouplens.org/datasets/movielens/20m/: This is a benchmark dataset to evaluate recommendation algorithms, which contains 20 million users’ 5star ratings on movies. The dataset includes 27,278 movies and 138,493 users, i.e., feature fields, where each user has at least 20 ratings.
3.2. Evaluation Metrics
AutoLoss is general for many recommendation tasks. To evaluate its effectiveness, we conduct binary classification (i.e., clickthrough rate prediction) on Criteo, and multiclass classification (i.e., 5star ratings) on ML20m. The two classification experiments are evaluated by AUC^{4}^{4}4We evaluate the AUC for multiclass classification in a onevsrest manner. and Logloss, where higher AUC or lower Logloss mean better performance. It is worth noting that slightly higher AUC and lower Logloss at 0.001level are considered significant in recommendations (Guo et al., 2017).
Data  Criteo  ML20m 

# Interactions  45,840,617  20,000,263 
# Feature Fields  39  2 
# Feature Values  1,086,810  165,771 
# Behavior  click or not  rating 15 
3.3. Implementation
We implement AutoLoss based on a public library^{5}^{5}5https://github.com/rixwew/pytorchfm, which contains 16 representative recommendation models. We develop AutoLoss as an independent class, so we can easily apply our framework for all these models. In this paper, we only show the results on DeepFM (Guo et al., 2017) and IPNN (Qu et al., 2016) due to the page limitation. To be specific, AutoLoss framework mainly contains two networks, i.e., the DRS network and the controller network.
Dataset  Model  Metric  Methods  

Focal  KL  Hinge  CE  MeLU  BOHB  DARTS  SLF  AutoLoss  
Criteo  DeepFM  AUC  0.8046  0.8042  0.8049  0.8056  0.8063  0.8065  0.8067  0.8081  0.8092* 
Logloss  0.4466  0.4469  0.4463  0.4457  0.4436  0.4435  0.4433  0.4426  0.4416*  
Criteo  IPNN  AUC  0.8077  0.8072  0.8079  0.8085  0.8090  0.8092  0.8093  0.8098  0.8108* 
Logloss  0.4435  0.4437  0.4432  0.4428  0.4423  0.4422  0.4423  0.4418  0.4409*  
ML20m  DeepFM  AUC  0.7681  0.7682  0.7685  0.7692  0.7695  0.7695  0.7696  0.7705  0.7717* 
Logloss  1.2320  1.2317  1.2316  1.2310  1.2307  1.2305  1.2305  1.2299  1.2288*  
ML20m  IPNN  AUC  0.7721  0.7722  0.7725  0.7733  0.7735  0.7734  0.7736  0.7745  0.7756* 
Logloss  1.2270  1.2269  1.2266  1.2260  1.2256  1.2257  1.2255  1.2249  1.2236* 
“*
” indicates the statistically significant improvements (i.e., twosided ttest with
) over the best baseline.: the higher the better; : the lower the better.
For the DRS network, (a) Embedding layer: we set the embedding size as 16 following the existing works (Zhu et al., 2020). (b) Interaction layer: we leverage factorization machine and inner product network to capture the interactions among feature fields for DeepFM and IPNN, respectively. (c) MLP layer
: we have two fullyconnected layers, and the layer size is 128. We also employ batch normalization, dropout (
) and ReLU activation for both layers. (d) Output layer: original DeepFM and IPNN are designed for clickthrough rate prediction, which use sigmoid activation for negative loglikelihood function. To fit the 5class classification task on ML20m, we modify the output layer correspondingly. i.e., the output layer is 5dimensional with softmax activation.For the controller network, (a) Input layer: the inputs are the ground truth labels and the predictions from DRS network. (b) MLP layer: we also use two fullyconnected layers with the layer size 128, batch normalization, dropout () and ReLU activation. (3) Output layer: the controller network will output continuous loss probabilities with softmax activation, whose dimension equals to the number of candidate loss functions.
For other hyperparameters, (a) Gumbelsoftmax: we use an annealing scheme for temperature , where is the training step. (b) Optimization: we set the learning rate as for updating both DRS network and controller network with Adam optimizer and batchsize 2000. (c) We select the hyperparameters of the AutoLoss framework via crossvalidation, and we also do parametertuning for baselines correspondingly for a fair comparison.
3.4. Overall Performance Comparison
AutoLoss is compared with the following loss function design and search methods:

[leftmargin=*]

Fixed loss function: the first group of baselines leverages a predefined and fixed loss function. We utilize Focal loss, KL divergence, Hinge loss and crossentropy (CE) loss for both classification tasks.

Fixed weights over loss functions: this group of baselines aims to learn fixed weights over the loss functions in the first group, without considering the difference among data examples. In this group, we use the metalearning method MeLU (Lee et al., 2019), as well as automated machine learning methods BOHB (Falkner et al., 2018) and DARTS (Liu et al., 2018).

Data examplewise loss weights: this group learns to assign different loss weights for different data examples according to their convergence behaviors. One existing work, stochastic loss function (SLF) (Liu and Lai, 2020), belongs to this group.
The overall performance is shown in Table 2. It can be observed that: (i) The first group of baselines achieves the worst recommendation performance in both recommendation tasks. Their optimizations are based on predefined and fixed loss functions during the training stage. This result demonstrates that leveraging a predefined and fixed loss function throughout can downgrade the recommendation quality. (ii) The methods in the second group outperform those in the first group. These methods try to learn weights over candidate loss functions according to their contributions to the optimization, and then combine them in a weighted sum manner. This validates that incorporating multiple loss functions in optimization can enhance the performance of deep recommender systems. (iii) The second group performs worse than the SLF, since the weights they learned are unified and static, which completely overlooks the various behaviors among different data examples. Therefore, SLF performs better by taking this factor into account. (iv) The decision network of SLF is optimized on the same training batch with the main DRS network via backpropagation, which can lead to overfitting on the training batch. AutoLoss updates the DRS network on the training batch while updating the controller on the validation batch, which improves the model generalizability and results in better recommendation performance.
To summarize, AutoLoss achieves significantly better performance than stateoftheart baselines on both datasets and tasks, which demonstrates its effectiveness.
3.5. Transferability Study
In this subsection, we study the transferability of the controller. Specifically, we want to investigate (i) whether the controller trained with one DRS model can be applied to other DRS models; and (ii) whether the controller learned on one dataset can be directly used on other datasets.
To study the transferability across different DRS models, we leverage the controller trained via DeepFM and AutoLoss on Criteo, fix its parameters and apply it to train NFM (He and Chua, 2017) and AutoInt (Song et al., 2019) on Criteo. The results are demonstrated in Figure 3 (a)(d), where (i) “” means that we directly train the new DRS model via minimizing the crossentropy (CE) loss, which is the best single and fixed loss function in Table 2; (ii) “” is that we use the controller upon DeepFM and SLF, which is the best baseline in Table 2; and (iii) “” denotes that we use the controller based on DeepFM and AutoLoss. From the figures, we can observe that performs superior to , which indicates that a pretrained controller can improve other DRS models’ training performance. More importantly, outperforms , which validates AutoLoss’s better transferability across different DRS models.
To study the transferability between different datasets, we train a controller upon Criteo dataset with DeepFM and AutoLoss, fix its parameters and apply it to train a new DeepFM on the Avazu dataset^{6}^{6}6Avazu is another benchmark dataset for CTR prediction, which contains 40 million user clicking behaviors in 11 days with categorical feature fields. https://www.kaggle.com/c/avazuctrprediction/, i.e., “”. Also, we denote that (i) “”: DeepFM is directly optimized by minimizing crossentropy (CE) loss on Avazu dataset; and (ii) “”: DeepFM is optimized on the new dataset with the assistance of a controller pretrained with DeepFM+SLF on Criteo. In Figure 3 (e)(f), shows superior performance over and , which proves its better transferability between different datasets.
In summary, AutoLoss has better transferability across different DRS models and different recommendation datasets, which improves its usability in realworld recommender systems.
3.6. Impact of Model Components
In this subsection, in order to understand the contributions of important model components of AutoLoss, we systematically eliminate each component and define the following variants:

[leftmargin=*]

AL1: This variant aims to assess the contribution of the controller. Thus, we assign equivalent weights on four candidate loss functions, i.e., .

AL2: In this variant, we eliminate the Gumbelsoftmax operation, and directly use the controller’s output, i.e., the continuous loss probabilities from standard softmax activation, which aims to evaluate the impact of Gumbelsoftmax.
Dataset  Model  Metric  Methods  

AL1  AL2  AutoLoss  
Criteo  DeepFM  AUC  0.8052  0.8083  0.8092* 
Logloss  0.4460  0.4422  0.4416*  
Criteo  IPNN  AUC  0.8081  0.8102  0.8108* 
Logloss  0.4431  0.4416  0.4409* 
“*” indicates the statistically significant improvements (i.e., twosided ttest with ) over the best baseline.
: the higher the better; : the lower the better.
The results on the Criteo dataset are shown in Table 3. First, AL1 has worse performance than AutoLoss, which validates the necessity to introduce the controller network. It is noteworthy that, AL1 performs worse than all loss function search methods, and even the fixed crossentropy (CE) loss in Table 2, which indicates that equally incorporating all candidate loss functions cannot guarantee better performance. Second, AutoLoss outperforms AL2. The main reason is that AL2 always generates gradients based on all the loss functions, which introduces some noisy gradients from the suboptimal candidate loss functions. In contrast, AutoLoss can obtain appropriate gradients by filtering out suboptimal loss functions via Gumbelsoftmax, which enhances the model robustness.
3.7. Efficiency Study
This section compares AutoLoss’s training efficiency with other loss function searching methods, which is an important metric to deploy a DRS model in realworld applications. Our experiments are based on one GeForce GTX 1060 GPU.
The results of DeepFM on Criteo dataset are illustrated in Figure 4 (a). We can observe that AutoLoss achieves the fastest training speed. The reasons are twofold. First, AutoLoss can generate the most appropriate gradients to update DRS, which increases the optimization efficiency. Second, we update the controller once after every 7 times DRS is updated, i.e., the controller updating frequency . This trick not only reduces the training time ( in this case) with fewer computations, but also enhances the performance. In Figure 4 (b)(c) where axis is , we find that DeepFM performs the best when , while updating too frequently/infrequently can lead to suboptimal AUC/Logloss.
To summarize, AutoLoss can efficiently achieve better performance, making it easier to be launched in realworld recommender systems.
4. Related Work
In this section, we briefly introduce the works related to our study. We first go over the latest studies in loss function search and then review works about AutoML for recommendations.
4.1. Loss Function Search
The loss function plays an essential part in a deep learning framework. The choice of the loss function significantly affects the performance of the learned model. A lot of efforts have been made to design desirable loss functions for specific tasks. For example, in the field of image processing, Rahman and Wang (2016)
argued that the typical crossentropy loss for semantic segmentation shows great limitations in aligning with evaluation metrics other than global accuracy.
Ronneberger et al. (2015); Wu et al. (2016b) designed loss functions by taking class frequency into consideration to cater to the mIoU metric. Caliva et al. (2019); Qin et al. (2019) designed losses with larger weights at boundary regions to improve the boundary F1 score. Liu et al. (2016) proposed to replace the traditional Softmax loss with large margin Softmax (LSoftmax) loss to improve feature discrimination in classification tasks. Fan et al. (2019) used sphere Softmax loss for the person reidentification task and obtained stateoftheart results. The loss functions mentioned above are all designed manually, requiring ample expert knowledge, nontrivial time, and many human efforts.Recently, automated loss function search draws increasing interests of researchers from various machine learning (ML) fields. Xu et al. (2018) investigated how to automatically schedule iterative and alternate optimization processes for ML models. Liu and Lai (2020) proposed to optimize the stochastic loss function (SLF), where the loss function of an ML model was dynamically selected. During training, model parameters and the loss parameters are learned jointly. Li et al. (2020) proposed automatically searching specific surrogate losses to improve different evaluation metrics in the image semantic segmentation task. Jin et al. (2021) composed multiple selfsupervised learning tasks to jointly encode multiple sources of information and produce more generalizable representations, and developed two automated frameworks to search the task weights. Besides, Li et al. (2019); Wang et al. (2020)
designed search spaces for a series of existing loss functions and developed algorithms to search for the best parameters of the probability distribution for sampling loss functions.
4.2. AutoML for Recommendation
AutoML techniques are now widely used to automatically design deep recommendation systems. Previous works mainly focused on the design of the embedding layer and the selection of feature interaction patterns.
In terms of the embedding layer, Joglekar et al. (2020); Zhao et al. (2020a); Liu et al. (2021) proposed novel methods to automatically select the best embedding size for different feature fields in a recommendation system. Zhao et al. (2020b); Liu et al. (2020b) proposed to dynamically search embedding sizes for users and items based on their popularity in the streaming setting. Similarly, Ginart et al. (2019) proposed to use mixed dimension embeddings for users and items based on their query frequency. Kang et al. (2020) proposed a multigranular quantized embeddings (MGQE) technique to learn impact embeddings for infrequent items. Cheng et al. (2020) proposed to perform embedding dimension selection with a soft selection layer, making the dimension selection more flexible. Guo et al. (2020) focused on the embeddings of numerical features. They proposed AutoDis, which automatically discretizes features in numerical fields and maps the resulting categorical features into embeddings.
As for feature interaction, Luo et al. (2019) proposed AutoCross that produces highorder cross features by performing beam search in a treestructure feature space. Song et al. (2020); Khawar et al. (2020); Liu et al. (2020a); Xue et al. (2020) proposed to automatically discover feature interaction architectures for clickthrough rate (CTR) prediction. Tsang et al. (2020) proposed a method to interpret the feature interactions from a source recommendation model and apply them in a target recommendation model.
To the best of our knowledge, we are the first to investigate the automated loss function search for deep recommendation systems.
5. Conclusion
We propose a novel endtoend framework, AutoLoss, to enhance recommendation performance and deep recommender systems’ training efficiency by selecting appropriate loss functions in a datadriven manner. AutoLoss can automatically select the proper loss function for each data example according to their varied convergence behaviors. To be specific, we first develop a novel controller network, which generates continuous loss weights based on the ground truth labels and the DRS’ predictions. Then, we introduce a Gumbelsoftmax operation to simulate the hard selection over candidate loss functions, which filters out the noisy gradients from suboptimal candidates. Finally, we can select the optimal candidate according to the output from Gumbelsoftmax. We conduct extensive experiments to validate the effectiveness of AutoLoss on two widely used benchmark datasets. The results show that our framework can improve recommendation performance and training efficiency with excellent transferability.
Acknowledgements
This work is supported by National Science Foundation (NSF) under grant numbers IIS1907704, IIS1928278, IIS1714741, IIS1715940, IIS1845081, CNS1815636, and an internal research fund from the Hong Kong Polytechnic University (project no. P0036200).
References
 Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §2.4.
 Distance map loss penalty term for semantic segmentation. arXiv preprint arXiv:1908.03679. Cited by: §4.1.
 Regression analysis by example. John Wiley & Sons. Cited by: §1.
 Differentiable neural input search for recommender systems. arXiv preprint arXiv:2006.04466. Cited by: §4.2.
 Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: footnote 1.
 Regression. Springer. Cited by: §1.

BOHB: robust and efficient hyperparameter optimization at scale
. In International Conference on Machine Learning, pp. 1437–1446. Cited by: 2nd item.  Attacking blackbox recommendations via copying crossdomain user profiles. arXiv preprint arXiv:2005.08147. Cited by: §1.
 Spherereid: deep hypersphere manifold embedding for person reidentification. Journal of Visual Communication and Image Representation 60, pp. 51–58. Cited by: §4.1.
 Deep retrieval: an endtoend learnable structure model for largescale recommendations. arXiv preprint arXiv:2007.07203. Cited by: §1.
 Towards longterm fairness in recommendation. arXiv preprint arXiv:2101.03584. Cited by: §1.
 Mixed dimension embeddings with application to memoryefficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §4.2.
 Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §2.3.
 CoSoLoRec: joint factor model with content, social, location for heterogeneous pointofinterest recommendation. In International Conference on Knowledge Science, Engineering and Management, pp. 613–627. Cited by: §1.
 AutoDis: automatic discretization for embedding numerical features in ctr prediction. arXiv preprint arXiv:2012.08986. Cited by: §4.2.

DeepFM: a factorizationmachine based neural network for ctr prediction.
In
Proceedings of the 26th International Joint Conference on Artificial Intelligence
, pp. 1725–1731. Cited by: §1, §2.2.4, §2.2, §3.2, §3.3.  Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: §3.5.
 Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §2.3, §2.3.
 Selfpaced learning with diversity. In Advances in Neural Information Processing Systems, pp. 2078–2086. Cited by: §2.4.
 Automated selfsupervised learning for graphs. External Links: 2106.05470 Cited by: §4.1.
 Neural input search for large scale recommendation models. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2387–2397. Cited by: §4.2.
 Learning multigranular quantized embeddings for largevocab categorical features in recommender systems. In Companion Proceedings of the Web Conference 2020, pp. 562–566. Cited by: §4.2.
 AutoFeature: searching for feature interactions and their architectures for clickthrough rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 625–634. Cited by: §4.2.

Panoptic feature pyramid networks.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 6399–6408. Cited by: §1.  MeLU: metalearned user preference estimator for coldstart recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1073–1082. Cited by: 2nd item.
 Amlfs: automl for loss function search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8410–8419. Cited by: §4.1.
 Auto segloss: searching metric surrogates for semantic segmentation. arXiv preprint arXiv:2010.07930. Cited by: §4.1.
 Xdeepfm: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: §2.2.
 Autofis: automatic feature interaction selection in factorization models for clickthrough rate prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2636–2645. Cited by: §4.2.
 Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §2.1, §2.5, 2nd item.
 Automated embedding size search in deep recommender systems. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 2307–2316. Cited by: §4.2.
 Stochastic loss function. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 4884–4891. Cited by: 3rd item, §4.1.
 Learnable embedding sizes for recommender systems. arXiv preprint arXiv:2101.07577. Cited by: §4.2.

Largemargin softmax loss for convolutional neural networks.
. In ICML, Vol. 2, pp. 7. Cited by: §4.1.  An experimental evaluation of pointofinterest recommendation in locationbased social networks. Proceedings of the VLDB Endowment 10 (10), pp. 1010–1021. Cited by: §1.
 Autocross: automatic feature crossing for tabular data in realworld applications. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1936–1945. Cited by: §4.2.
 Personalized deep learning for tag recommendation. In PacificAsia Conference on Knowledge Discovery and Data Mining, Cited by: §1.
 Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §2.1, §2.5, §2.5.
 Basnet: boundaryaware salient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7479–7489. Cited by: §4.1.
 Productbased neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1149–1154. Cited by: §2.2.4, §2.2, §3.3.
 Optimizing intersectionoverunion in deep neural networks for image segmentation. In International symposium on visual computing, pp. 234–244. Cited by: §4.1.
 A collaborative location based travel recommendation system through enhanced rating prediction for the group of users. Computational intelligence and neuroscience 2016. Cited by: §1.
 Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000. Cited by: §2.2.2, §2.2.
 Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §4.1.
 Towards automated neural interaction discovery for clickthrough rate prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 945–955. Cited by: §4.2.
 Autoint: automatic feature interaction learning via selfattentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: §3.5.

Improved recurrent neural networks for sessionbased recommendations
. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 17–22. Cited by: §2.2.4.  Feature interaction interpretability: a case for explaining adrecommendation systems via neural interaction detection. arXiv preprint arXiv:2006.10966. Cited by: §4.2.

Loss function search for face recognition
. In International Conference on Machine Learning, pp. 10029–10038. Cited by: §4.1.  Personal recommendation using deep recurrent neural networks in netease. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pp. 1218–1229. Cited by: §1.
 Bridging categorylevel and instancelevel semantic image segmentation. arXiv preprint arXiv:1605.06885. Cited by: §4.1.
 Upsnet: a unified panoptic segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8818–8826. Cited by: §1.
 Autoloss: learning discrete schedules for alternate optimization. arXiv preprint arXiv:1810.02442. Cited by: §4.1.
 AutoHash: learning higherorder feature interactions for deep ctr prediction. IEEE Transactions on Knowledge and Data Engineering. Cited by: §4.2.
 Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR) 52 (1), pp. 1–38. Cited by: §1.
 DEAR: deep reinforcement learning for online advertising impression in recommender systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 750–758. Cited by: §1.
 Memoryefficient embedding for recommendations. arXiv preprint arXiv:2006.14827. Cited by: §4.2.
 Autoemb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §4.2.
 Toward simulating environments in reinforcement learning based recommendations. arXiv preprint arXiv:1906.11462. Cited by: §1.
 Deep reinforcement learning for pagewise recommendations. In Proceedings of the 12th ACM Recommender Systems Conference, pp. 95–103. Cited by: §1.
 Wholechain recommendations. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 1883–1891. Cited by: §1.
 Exploring the choice under conflict for social event participation. In International Conference on Database Systems for Advanced Applications, pp. 396–411. Cited by: §1.
 Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1040–1048. Cited by: §1.
 Jointly learning to recommend and advertise. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3319–3327. Cited by: §1.
 FuxiCTR: an open benchmark for clickthrough rate prediction. arXiv preprint arXiv:2009.05794. Cited by: §3.3.
 Neural interactive collaborative filtering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 749–758. Cited by: §1.