1. Introduction
Driven by the recent advances in deep learning, there have been increasing interests in developing deep learning based recommender systems (DLRSs) (Zhang et al., 2017; Nguyen et al., 2017; Wu et al., 2016). DLRSs have boosted the recommendation performance because of their capacity of effectively catching the nonlinear useritem relationships, and learning the complex abstractions as data representations (Zhang et al., 2019). Architectures of DLRS often mainly consist of three key components: (i) embedding layers
that map raw user/items features in a high dimensional space to dense vectors in a low dimensional
embedding space, (ii) hidden layers that perform nonlinear transformations to transform the input features, and (iii) output layersthat make predictions for specific recommendation tasks (e.g. regression and classification) based on the representations from hidden layers. The majority of existing researches have focused on designing sophisticated neural network architectures for the hidden layers and output layers, while the embedding layers have not gained much attention. However, in the largescale realworld recommender systems with numerous users and items, embedding layers play a tremendously crucial role in accurate recommendations. The most typical use of embedding is to transform an identifier, i.e., userid or itemid, into a realvalued vector. Each embedding can be considered as the latent representation of a specific user or item. Compared to handcrafted features, welllearned embeddings have been demonstrated to significantly enhance the recommendation performance
(Cheng et al., 2016; Guo et al., 2017; Pan et al., 2018; Qu et al., 2016; Zhou et al., 2018). This is because embeddings can reduce the dimensionality of categorical variables (e.g. onehot identifiers) and meaningfully represent users/items in the latent space. Furthermore, nearest neighbors in the embedding space can be viewed as similar users/items; while the mapping of onehot space is completely uninformed where similar users/items are not projected closer to each other.The majority of existing DLRSs often adopt a unified and fixed dimensionality in their embedding layers. In other words, all users (or items) share the same and fixed embedding size. It naturally raises a question – do we need different embedding dimensions for different users/items? To investigate this question, we conduct a preliminary study on the movielens20m dataset ^{1}^{1}1https://grouplens.org/datasets/movielens/20m/. For each user, we first select a fixed part of his/her ratings (labeled interactions with items) as test and then we choose ratings as training. Figure 1 illustrates how the recommendation performance of a typical DLRS (Cheng et al., 2016) with embedding dimensions , and in terms of the meansquarederror (MSE) and accuracy changes when we vary . Lower MSE (or higher accuracy) means better performance. Note that we refer to the number of interactions users/items have as popularity in this work. From the figure, with the increasing of the popularity , (i) the performance of models with different embedding dimensions increases but larger embedding dimensions gain more; and (ii) smaller embedding dimensions first work better and then are outperformed by larger embedding dimensions. These observations are quite expected since the embedding size often determines the number of model parameters to learn and the capacity to encode information by the embedding. On the one hand, smaller embedding dimensions often mean fewer model parameters and lower capacity. Thus, they can work well when the popularity is small. However, the capacity limits the performance with the increasing popularity when the embedding needs to encode more information. On the other hand, larger embedding dimensions usually indicate more model parameters and higher capacity. They typically need sufficient data to be well trained. Therefore they cannot work well when the popularity is small but they have the potential to capture more information as the popularity increases. Given that users/items have very different popularity in a recommender system, different embedding dimensions should be allowed by DLRSs. This property is highly desired in practice since realworld recommender systems are streaming where the popularity is highly dynamic. For example, new interactions are rapidly occurred and new users/items are continuously added.
In this paper, we aim to enable different embedding dimensions for different users/items at the embedding layer under the streaming setting. We face tremendous challenges. First, the number of users/items in realworld recommender systems is very large and the popularity is highly dynamics, it is hard, if possible, to manually select different dimensions for different users/items. Second, the input dimension of the first hidden layer in existing DLRSs is often unified and fixed, it is difficult for them to accept different dimensions from the embedding layers. Our attempt to solve these challenges leads to an endtoend differentiable AutoML based framework (AutoEmb), which can make use of various embedding dimensions in an automated and dynamic manner. Our experiments in realworld ecommerce data demonstrate the effectiveness of the proposed framework.
2. Framework
As discussed in Section 1, assigning different embedding sizes to different users/items in the streaming recommendation setting faces two challenges. To address these challenges, we propose an AutoML based endtoend framework, which can automatically and dynamically leverage embedding with various dimensions. In the following, we will first illustrate a widely used DLRS as our basic architecture, then we will enhance it to enable various embedding dimensions, next we will discuss how to automatically and dynamically select the various dimensions, and finally, an AutoML based optimization algorithm will be provided for the streaming recommendation.
2.1. A Basic DLRS Architecture
We illustrate a basic DLRS architecture in Figure 2, which contains three components: (i) the embedding layers that map user/item IDs () into dense and continuous valued embedding vectors (
), (ii) the hidden layers which are fully connected layers that nonlinearly transform the embedding vectors (
) into hierarchical feature representations, and (iii) the output layer that generates the prediction for recommendation. Given a useritem interaction, the DLRS first performs embeddinglookup processes according to the userid and itemid, and concatenates the two embeddings; then the DLRS feeds the concatenated embedding and makes predictions via the hidden and output layers. This typical DLRS architecture is widely used in recent recommendations (Cheng et al., 2016). However, it has fixed neural network architectures, which cannot handle different embedding dimensions. Next, we will enhance this basic DLRS architecture to enable various embedding dimensions.2.2. The Enhanced DLRS Model
As discussed in Section 1, shorter embeddings with fewer model parameters can generate better recommendations when the popularity is small; while with the increase of popularity, longer embeddings with more model parameters and higher capacity achieve better recommendation performance. Motivated by this observation, assigning different embedding dimensions for users/items with different popularity is highly desired. However, the basic DLRS architecture in Section 2.1 is not able to handle various embedding dimensions because of its fixed neural network architecture.
The basic idea to address this challenge is to transform various embedding dimensions into the same dimension, so that the DLRS can select one of the transformed embeddings according to current user/item popularity. Figure 3 illustrates the embedding transformation and selection process. Suppose we have embedding spaces , and the dimension of an embedding in each space is , where . We define is the set of embeddings for a given user from all embedding spaces. To unify the embeddings vectors , we introduce a component with fullyconnected layers, which transform into same dimension :
(1) 
where are weight matrices and
are bias vectors. After the linear transformations, we have mapped the original embedding vectors
into the same dimension . In practice, we can observe that the magnitude of the transformed embeddings varies significantly, which makes them become incomparable. To tackle this challenge, we conduct BatchNorm (Ioffe and Szegedy, 2015) with the Tanh activation (Karlik and Olgac, 2011) on the transformed embeddings as follows:(2) 
where is the minibatch mean and
is the minibatch variance for
. is a constant added to the minibatch variance for numerical stability. the Tanh function activates the normalized embedding into . After BatchNorm and activation, the linearly transformed embeddings become to magnitudecomparable embedding vectors with the same dimension. Given an item , we conduct the same transformations as these in Equations (1) and (2) on its embeddings , and obtain magnitudecomparable ones that share the same dimension.According to the popularity, the DLRS will select a pair of transformed embeddings as the representations of the user and item :
(3) 
The embedding size is selected by a controller that will be detailed in the next subsection. Then, we concatenate user’s and item’s representations and feed as the input into fullyconnected hidden layers:
(4) 
where is the weight matrix and is the bias vector for the hidden layer. Finally, the output layer, which is subsequent to the hidden layers, generates the prediction of user ’s satisfaction with item as:
(5) 
where and
are output layer’s weight matrix and bias vector. Activation function
varies according to different prediction tasks, such as Sigmoid function for app installation prediction (regression)
(Cheng et al., 2016), and Softmax for buyornot prediction (classification) (Tan et al., 2016). By minimizing the loss between predictions and labels, the DLRS updates the parameters of all embeddings as well as neural networks through backpropagation.2.3. The Controller
Inspired by the observation in Figure 1, shorter embeddings perform better if popularity is small and longer embeddings perform better when popularity is large. Therefore, it is highly desired to use different embedding sizes for different users and items. Given a large number of users/items and the dynamic nature of their popularity, it is hard, if possible, to determine embedding sizes manually.
To address this challenge, we propose an AutoML based approach to automatically determine the embedding sizes. To be specific, we design two controller networks that decide the embedding sizes for users and items, separately. Figure 4 illustrates the controller network architecture. For a specific user/item, the controller’s input consists of two parts: (i) the current popularity of the user/item; and (ii) the contextual information
such as the previous hyperparameters and the corresponding loss the user/item obtains. This contextual information can be viewed as signals that measure whether hyperparameters assigned to the user/item previously work well. In other words, if they work well, the new hyperparameters generated for this time should be somewhat similar. The controller takes abovementioned inputs, transforms them via several layers of fullyconnected networks, and then generates hierarchical feature representations. The output layer is a Softmax layer with
output units. In this work, we use to denote the output units for the controller of users and utilize for items. Theunit denotes the probability to select the
embedding space. The embedding space is automatically selected as the one corresponding to the largest probability. It is formally defined as:(6) 
With the controller, the task of the embedding dimensionality search reduces to optimize the controllers’ parameters, so as to automatically generate suitable or according to the popularity of a user/item.
2.4. Soft Selection
Eq. (6) performs a hard selection on the embedding spaces. In other words, each time, we only select one embedding space with the largest probability from the controller. This hard selection makes the whole framework not endtoend differentiable. To tackle this challenge, in this work, we choose a soft selection. For , its embedding is a weighted sum of where the weight of is the corresponding probability from the controller. Therefore, the representations of the user and item can be reformulated as:
(7)  
With soft selection, the enhanced DLRS is endtoend differentiable, and we illustrate the whole augmented DLRS architecture in Figure 5 where we add the transformed embedding layer which performs soft selection of embedding spaces and the selection process is determined by two controllers for users and items, respectively.
2.5. An Optimization Method
In this subsection, we investigate the optimization of the proposed framework. With the soft selection, the optimization task is to jointly optimize the parameters of the DLRS, say , and the parameters of controllers, say . Since our framework is endtoend differentiable, inspired by the concept of differentiable architecture search (DARTS) techniques (Liu et al., 2018a), we adapt a DARTS based optimization for the AutoEmb framework, which updates and by optimizing the training loss and the validation loss through gradient descent, respectively. Note that both training and validation losses are determined not only by the parameters of DLRS, but also the parameters of the controller.
The goal for embedding dimensionality search is to find optimal parameters that minimizes the validation loss , where the parameters of DLRS are obtained by minimizing the training loss . This a bilevel optimization problem (Maclaurin et al., 2015; Pedregosa, 2016; Pham et al., 2018), where is the the upperlevel variable and is the lowerlevel variable:
(8)  
Optimizing is timeconsuming due to the expensive inner optimization of . Therefore, we leverage the approximation scheme as DARTS:
(9)  
where is the learning rate for updating
. The approximation scheme estimates
by updating one training step, which avoids completely optimizing to convergence. The firstorder approximation with can even lead to some speedup, but empirically worse performance. It is worth to note that, different from DARTS on computer version tasks, there is no deriving discrete architecture stage, during which DARTS generates a discrete neural network architecture by selecting the most likely operation according to the softmax probabilities. This is because the popularity of users/items is highly dynamic with new useritem interactions occur, which prohibit us from selecting a particular embedding dimension for a user/item.Object  Movielens20m  Movielenslatest  Netflix Prize 

# user  138,493  283,228  480,189 
# item  27,278  58,098  17,770 
# interaction  20,000,263  27,753,444  100,480,507 
# rating  15  15  15 
Dataset  Task  Metrics  Methods  
FSE  SAM  DARTS  AutoEmb  

Regression  MSE Loss  0.18400.0003  0.18190.0002  0.18120.0003  0.18030.0002  
Accuracy  0.72110.0003  0.72390.0002  0.72450.0002  0.72590.0003  
Classification  CE Loss  1.14640.0006  1.14230.0002  1.14160.0003  1.13950.0005  
Accuracy  0.49430.0003  0.49580.0005  0.4970 0.0002  0.49820.0002  

Regression  MSE Loss  0.18090.0002  0.18030.0003  0.17970.0002  0.17900.0001  
Accuracy  0.72600.0003  0.72750.0001  0.72800.0002  0.72870.0002  
Classification  CE Loss  1.12570.0003  1.1249 0.0002  1.12420.0002  1.1233 0.0002  
Accuracy  0.5049 0.0002  0.50620.0002  0.50710.0001  0.50790.0002  

Regression  MSE Loss  0.18210.0003  0.18120.0001  0.18070.0001  0.17790.0002  
Accuracy  0.72900.0002  0.73020.0002  0.73090.0003  0.73160.0001  
Classification  CE Loss  1.11020.0003  1.10920.0001  1.10850.0003  1.10760.0002  
Accuracy  0.50960.0003  0.51090.0003  0.51190.0001  0.51270.0002 
We present our DARTS based optimization algorithm in Algorithm 1. In each iteration, we first update controllers’ parameters upon validation set collected from previous useritem interactions (line 23), then we collect a new minibatch of useritem interactions as training data (line 4); next we produce hyperparameters and via collectors with its current parameters for the training examples(line 5); then we make predictions via DLRS with its current parameters and the assistance of hyperparameters (line 6); next we evaluate the prediction performance and record it (line 7); and finally, we update DLRS’s parameters.
It is worth to note that, in the batchbased streaming recommendation setting, the optimization process follows an “evaluate, train, evaluate, train…” fashion (Chang et al., 2017). In other words, we always continuously collect new useritem interaction data; when we have a full minibatch of examples, we first make predictions based on our AutoEmb framework with its current parameters, and evaluate the performance of the predictions and record it; then we update the parameters of AutoEmb by minimizing the loss between the predictions and ground truth labels; next we collect another minibatch of useritem interactions and perform the same process. Therefore, there is no presplit validation set and test set. In other words, (i) to calculate , we sample a minibatch of previous useritem interactions as the validation set; (ii) there is no independent test stage, during which we fix all the parameters and evaluate the proposed framework on examples in the presplit test set; and (iii) following the streaming recommendation setting in (Chang et al., 2017), we also have offline parameter estimation and online inference stages, where we use historical useritem interactions to pretrain the AutoEmb’s parameters in the offline parameter estimation stage, and then we launch the AutoEmb online and continuously update the AutoEmb parameters in the online inference stage. In other words, AutoEmb’s parameters are updated in both stages following the Algorithm 1.
3. Experiments
In this section, we conduct extensive experiments to evaluate the effectiveness of the proposed AutoEmb framework. We mainly focus on two questions: (i) how the proposed framework performs compared to representative baselines; and (ii) how the controller contributes to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of important parameters on the performance of the proposed framework.
3.1. Datasets
We evaluate our method on widely used dataset: Movielens20m^{2}^{2}2https://grouplens.org/datasets/movielens/20m/, Movielenslatest^{3}^{3}3https://grouplens.org/datasets/movielens/latest/ and Netflix Prize data^{4}^{4}4https://www.kaggle.com/netflixinc/netflixprizedata. Some key statistics of the datasets are shown in Table 1. For each dataset, we use 70% useritem interactions for offline parameter estimation and the other 30% for online learning. To demonstrate the effectiveness of our framework in the embedding selection task, we eliminate other contextual features, e.g., users’ age and items’ category, to exclude the influence of other features, but it is straightforward to incorporate them into the framework for better recommendations.
3.2. Implement Details
Next we detail the architecture of DLRS and controllers. For DLRS, (i) embedding layer: we select sizes of embedding dimension , thus dimension of transformed embeddings is . We concatenate the three embeddings of each user/item, which significantly improves the embedding lookup speed; (ii) hidden layer: we have two hidden layers with the size and ; (iii) output layer: we do two types of tasks, for rating regression task, the output layer is , and for rating classification task, the output layer is with Softmax activation, because there are 5 classes of ratings. For controllers, (i) input layer: the input feature size is 38; (ii) hidden layer: we have two hidden layers with the size and ; (iii) output layer: the shape is with Softmax activation to generate the weights of sizes of embeddings. The batchsize is 500. The learning rate for DLRS and controllers are and , respectively. For the parameters of the proposed framework, we select them via crossvalidation. Correspondingly, we also do parametertuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections.
3.3. Evaluation Metrics
We conduct two types of tasks to demonstrate the effectiveness of the AutoEmb framework. For the regression task, we first binarize ratings to
, and then train the framework via minimizing the meansquarederror (MSE) loss. The performance can be evaluated by MSE Loss and accuracy (we use 0.5 as threshold to assign the labels). For the classification task, the ratings 15 are viewed as 5 classes, and the framework is trained by minimizing the crossentropy loss (CE Loss). The performance is measured by crossentropy and accuracy.3.4. Overall Performance in the Online Stage
We compare the proposed framework with the following representative baseline methods:

[leftmargin=*]

Fixedsize Embedding (FSE): In this baseline, we assign a fixed embedding size for all the users/items. For a fair comparison, we set the embedding size as . In other words, it occupies the same embedding space memory as AutoEmb.

Supervised Attention Model (
SAM): This baseline has the exact same architecture with AutoEmb, while we simultaneously update the parameters of DLRS and controllers on the same batch of training data, via an endtoend supervised learning manner.

Differentiable architecture search (DARTS): This baseline is a standard DARTS method, which trains realvalued weights for the three types of embedding dimensions.
It is worth to note that, Neural Input Search model (Joglekar et al., 2019) and Mixed Dimension Embedding model (Ginart et al., 2019) cannot be applied in the streaming recommendation setting, because they assume that the popularities of users/items are preknown and fixed, and then assign highlypopular users/items with large embedding dimensions. However, in realworld streaming recommender systems, the popularities are not preknown but highly dynamic.
The overall results of the online stage are shown in Table 2. We make the following observations: (i) SAM performs better than FSE, since SAM assigns attention weights on embeddings with different dimensions according to popularity, while FSE has a fixed embedding dimension for all the users/items. These results demonstrate that recommendation quality is indeed related to the popularity of users/items, and introducing different embedding dimensions and adjusting the weights on them according to popularity can boost the recommendation performance. (ii) DARTS outperforms SAM, because AutoML models like DARTS update controller’s parameters on the validation set, which can enhance the generalization, while endtoend models like SAM update the parameters of controllers simultaneously with DLRS, on the same batch training data, which may lead to overfitting. These results validate the effectiveness of AutoML techniques over conventional supervised learning in recommendations. (iii) Our proposed model AutoEmb has better performance than standard DARTS model. DARTS separately trains realvalued weights for each user/item on the three types of embedding dimensions. These weights of a specific user/item may not be welltrained because of the limited interactions of this user/item. The controller of AutoEmb can incorporate huge amounts of useritem interactions and capture the important characteristics from them. Also, the controller has an explicit input of popularity, which may assist the controller to learn the dependency between popularity and embedding dimensions, which DARTS cannot. These results demonstrate the necessity of developing a controller rather than only realvalued weights. (iv) After the offline parameter estimation stage, most users/items in the online stage have already become very popular. In other words, AutoEmb has stable improvement for popular users/items. AutoEmb has even more significant enhancement in the early training stage, and we will discuss that in the following sections.
To sum up, we can draw an answer to the first question: the proposed framework outperforms representative baselines on different datasets with different metrics. These results prove the effectiveness of the AutoEmb framework.
3.5. Performance with Popularity
Now, we will investigate whether the proposed controller can generate proper weights according to various popularity. Thus, we compare FSE without a controller, SAM with a supervisedattentive controller, and AutoEmb with an AutoML based controller. The results on Movielens20m dataset are shown in Figure 6, where axis is popularity and axis corresponds to performance, we omit the similar results on other datasets due to the limited space.
We make following observations: (i) When popularity is small, FSE performs worse than SAM and AutoEmb. This is because larger embeddings need sufficient data to be well learned. Smaller embeddings with fewer parameters can quickly capture some highlevel characteristics, which can help the coldstart predictions. (ii) With the increase of popularity, FSE outperforms SAM. This result is interesting but instructive, and the reason may be that, SAM’s controller overfits to a small number of training examples, which leads to suboptimal performance. On the contrary, AutoEmb’s controller was trained on validation set, which improves its generalization. This reason is also be validated in the following subsection. (iii) AutoEmb always outperforms FSE and SAM, which means the proposed framework is able to automatically and dynamically adjust the weights on embeddings with different dimensions according to the popularity. (iv) To further probe the weights generated by the AutoEmb’s controller according to popularity, we draw the distribution of weights for various popularity in Figure 7. We can observe that, the distribution shows on the small embeddings for small popularity, and shows on larger embeddings with the increase of popularity. This observation validates our above analysis. In summary, we can answer the second question: the controller of AutoEmb can produce reasonable weights for different popularity via an automated and dynamic manner.
3.6. Performance with Data
Training deep learning based recommender systems typically requires a large amount of useritem interaction data. Our proposed AutoEmb framework introduces an additional controller network as well as some additional parameters in the DLRS, which may make it hard to be well trained. We show the optimization process in Figure 8, where axis is the number of training examples, and axis corresponds to the performance.
It can be observed: (i) In the early training stage, SAM performs worst since its controller overfits to the insufficient training examples, which is also validated in Section 3.5. (ii) The overfitting problem of SAM gradually is mitigated with more data coming, and SAM outperforms FSE, which validates the necessity of weights on various embedding dimensions. (iii) AutoML outperforms SAM and FSE in the whole training process. Especially, it can significantly boost the early training stage with insufficient training examples.
4. Related Work
In this section, we briefly review works related to our study. In general, the related work can be grouped into the following categories.
The first category related to this paper is deep learning based recommender system, which is able to effectively capture the nonlinear and nontrivial useritem relationships, and enables the codification of more complex abstractions as data representations in the higher layers (Zhang et al., 2017). In recent years, a series of neural recommendation models based on deep learning techniques have been proposed with evident performance lifting. He et al.(He et al., 2017) firstly proposed Neural Collaborative Filtering (NCF) which utilizes a dual neural network to represent a twoway interaction between user preferences and items features. Guo et al.(Guo et al., 2017)
proposed an endtoend model named DeepFM to integrate factorization machine and Multilayer Perceptron (MLP) seamlessly. Xu et al.
(Xu et al., 2016)(Xu et al., 2017) fused tag annotations into personalized recommendation and proposed Deep Semantic Similarity based Personalized Recommendation (DSPR). Ali et al.(Elkahky et al., 2015) designed MultiView Deep Neural Network (MVDNN) which could model the interactions among users and items with multiple domains. Besides simple expression or transformation of MLP, there are also some other typical models with deep learning methods. Suvash et al.(Sedhain et al., 2015)introduced AutoRec along with itembased and userbased one to learn the lowerdimension feature representations of users and items. What’s more, Convolutional Neural Network (CNN) could be a strong support in extracting features for recommendation. Lei et al.
(Lei et al., 2016) utilized CNN in image recommendation. Zheng et al.(Zheng et al., 2017) used two parallel CNNs to model user and item features from review texts. Hidasi et al.(Hidasi et al., 2015) firstly proposed a sessionbased recommendation model named GRU4Rec to model the sequential influence of items’ transition. Lee et al.(Lee et al., 2016) proposed a hybrid model that integrates RNNs with CNNs for quotes recommendation. However, most of these works focus on designing sophisticated neural network architectures, while have not paid much attention to the embedding layers.The second category is about AutoML for Neural Architecture Search (NAS), which has raised much attention since (Zoph and Le, 2016)
, which adopts reinforcement learning approach with recurrent neural network (RNN) to train a large number of candidate models for convergence. Due to the high cost in training, a lot of research focus on proposing novel NAS model with lower hardware resources supporting. One direction is to sample a subset of all model components so that the optimal set could be learned with limited training steps. For instance, ENAS
(Pham et al., 2018) leverages a controller to sample subset of models, and SMASH(Brock et al., 2017) uses a hypernetwork to generate weights for sampled networks. DARTS(Liu et al., 2018a) and SNAS(Xie et al., 2018) regard the connection as a weight via backpropagation to optimize. Luo et al.(Luo et al., 2018) reflects the neural architectures into an embedding space so that the optimal embedding could be learned and given as feedback to the final architecture. Another direction is to reduce the size of the search space. (Real et al., 2019; Zhong et al., 2018; Liu et al., 2018b; Cai et al., 2018) propose searching convolution cells which could be stacked repeatedly. Zoph et al.(Zoph et al., 2018)developed NASNet architecture which use a transfer learning setting to show smaller datasets could perform better than larger datasets. MNAS
(Tan et al., 2019) proposed a hierarchical convolution cell block which could learn different structures. NAS has been widely used in different tasks, such as image classification(Real et al., 2017), natural language understanding(Chen et al., 2020), etc. Joglekar(Joglekar et al., 2019) firstly utilized NAS into large scale recommendation models and proposed a novel type of embedding named as Multisize Embedding (ME). However, it cannot be applied in the streaming recommendation setting, where the popularity is not preknown but highly dynamic.5. Conclusion
In this paper, we propose a novel framework AutoEmb, which aims to select different embedding dimensions according to user’s/item’s popularity in an automated and dynamic manner. In practical streaming recommender systems, due to the huge amounts of users/items and the dynamic nature of their popularity, it is hard, if possible, to manually select different dimensions for different users/items. Therefore, we proposed an AutoML based framework to automatically select from different embedding dimensions. To be specific, we first augment a widely used DLRS architecture to enable it to accept various embedding dimensions, then we propose a controller network, which could automatically select embedding dimensions for a specific user/item according to its current popularity. We evaluate our framework with extensive experiments based on widely used benchmark datasets. The results show that (i) our framework can significantly improve the recommendation performance with highly dynamic popularity; and (ii) the controller trained via an AutoML manner can dramatically enhance the training efficiency, especially when interaction data is insufficient.
There are several interesting research directions. First, in addition to automatically determine the embedding dimensions, we would like to investigate the method to automatically design the whole DLRS architecture. Second, we select the embedding dimensions via a soft manner, which makes the framework endtoend differentiable, but needs a larger embedding space. In the future, we would like to develop an endtoend framework with hard selection. Third, the framework is quite general to handle categorical features, thus we would like to investigate more applications of the proposed framework. Finally, we would like to develop a framework that can incorporate more types of features in addition to categorical ones.
References
 (1)
 Brock et al. (2017) Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. 2017. Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344 (2017).
 Cai et al. (2018) Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. 2018. Pathlevel network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639 (2018).
 Chang et al. (2017) Shiyu Chang, Yang Zhang, Jiliang Tang, Dawei Yin, Yi Chang, Mark A HasegawaJohnson, and Thomas S Huang. 2017. Streaming recommender systems. In Proceedings of the 26th International Conference on World Wide Web. 381–389.
 Chen et al. (2020) Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. 2020. AdaBERT: TaskAdaptive BERT Compression with Differentiable Neural Architecture Search. arXiv preprint arXiv:2001.04246 (2020).
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. ACM, 7–10.
 Elkahky et al. (2015) Ali Mamdouh Elkahky, Yang Song, and Xiaodong He. 2015. A multiview deep learning approach for cross domain user modeling in recommendation systems. In Proceedings of the 24th International Conference on World Wide Web. 278–288.
 Ginart et al. (2019) Antonio Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2019. Mixed Dimension Embeddings with Application to MemoryEfficient Recommendation Systems. arXiv preprint arXiv:1909.11810 (2019).
 Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorizationmachine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173–182.
 Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Sessionbased recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015).

Ioffe and Szegedy (2015)
Sergey Ioffe and
Christian Szegedy. 2015.
Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift. In
International Conference on Machine Learning
. 448–456.  Joglekar et al. (2019) Manas R Joglekar, Cong Li, Jay K Adams, Pranav Khaitan, and Quoc V Le. 2019. Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471 (2019).

Karlik and Olgac (2011)
Bekir Karlik and A Vehbi
Olgac. 2011.
Performance analysis of various activation
functions in generalized MLP architectures of neural networks.
International Journal of Artificial Intelligence and Expert Systems
1, 4 (2011), 111–122.  Lee et al. (2016) Hanbit Lee, Yeonchan Ahn, Haejun Lee, Seungdo Ha, and Sanggoo Lee. 2016. Quote recommendation in dialogue using deep neural network. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 957–960.

Lei
et al. (2016)
Chenyi Lei, Dong Liu,
Weiping Li, ZhengJun Zha, and
Houqiang Li. 2016.
Comparative deep learning of hybrid representations
for image recommendations. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. 2545–2553.  Liu et al. (2018b) Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. 2018b. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV). 19–34.
 Liu et al. (2018a) Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2018a. Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 (2018).
 Luo et al. (2018) Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and TieYan Liu. 2018. Neural architecture optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7827–7838.
 Maclaurin et al. (2015) Dougal Maclaurin, David Duvenaud, and Ryan Adams. 2015. Gradientbased hyperparameter optimization through reversible learning. In International Conference on Machine Learning. 2113–2122.
 Nguyen et al. (2017) Hanh TH Nguyen, Martin Wistuba, Josif Grabocka, Lucas Rego Drumond, and Lars SchmidtThieme. 2017. Personalized Deep Learning for Tag Recommendation. In PacificAsia Conference on Knowledge Discovery and Data Mining. Springer, 186–197.
 Pan et al. (2018) Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu. 2018. Fieldweighted factorization machines for clickthrough rate prediction in display advertising. In Proceedings of the 2018 World Wide Web Conference. 1349–1357.
 Pedregosa (2016) Fabian Pedregosa. 2016. Hyperparameter optimization with approximate gradient. arXiv preprint arXiv:1602.02355 (2016).
 Pham et al. (2018) Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. 2018. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268 (2018).
 Qu et al. (2016) Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. 2016. Productbased neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 1149–1154.

Real
et al. (2019)
Esteban Real, Alok
Aggarwal, Yanping Huang, and Quoc V
Le. 2019.
Regularized Evolution for Image Classifier Architecture Search. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4780–4789.  Real et al. (2017) Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan, Quoc V Le, and Alexey Kurakin. 2017. Largescale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine LearningVolume 70. JMLR. org, 2902–2911.

Sedhain
et al. (2015)
Suvash Sedhain,
Aditya Krishna Menon, Scott Sanner, and
Lexing Xie. 2015.
Autorec: Autoencoders meet collaborative filtering. In
Proceedings of the 24th international conference on World Wide Web. 111–112.  Tan et al. (2019) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. 2019. Mnasnet: Platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2820–2828.
 Tan et al. (2016) Yong Kiam Tan, Xinxing Xu, and Yong Liu. 2016. Improved recurrent neural networks for sessionbased recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. 17–22.
 Wu et al. (2016) Sai Wu, Weichao Ren, Chengchao Yu, Gang Chen, Dongxiang Zhang, and Jingbo Zhu. 2016. Personal recommendation using deep recurrent neural networks in NetEase. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE, 1218–1229.
 Xie et al. (2018) Sirui Xie, Hehui Zheng, Chunxiao Liu, and Liang Lin. 2018. SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926 (2018).
 Xu et al. (2016) Zhenghua Xu, Cheng Chen, Thomas Lukasiewicz, Yishu Miao, and Xiangwu Meng. 2016. Tagaware personalized recommendation using a deepsemantic similarity model with negative sampling. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 1921–1924.
 Xu et al. (2017) Zhenghua Xu, Thomas Lukasiewicz, Cheng Chen, Yishu Miao, and Xiangwu Meng. 2017. Tagaware personalized recommendation using a hybrid deep model. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 3196–3202.
 Zhang et al. (2017) Shuai Zhang, Lina Yao, and Aixin Sun. 2017. Deep Learning based Recommender System: A Survey and New Perspectives. arXiv preprint arXiv:1707.07435 (2017).
 Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR) 52, 1 (2019), 1–38.
 Zheng et al. (2017) Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of users and items using reviews for recommendation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. 425–434.
 Zhong et al. (2018) Zhao Zhong, Junjie Yan, Wei Wu, Jing Shao, and ChengLin Liu. 2018. Practical blockwise neural network architecture generation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2423–2432.
 Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.
 Zoph and Le (2016) Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).
 Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. 2018. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8697–8710.