AutoEmb: Automated Embedding Dimensionality Search in Streaming Recommendations

Deep learning based recommender systems (DLRSs) often have embedding layers, which are utilized to lessen the dimensionality of categorical variables (e.g. user/item identifiers) and meaningfully transform them in the low-dimensional space. The majority of existing DLRSs empirically pre-define a fixed and unified dimension for all user/item embeddings. It is evident from recent researches that different embedding sizes are highly desired for different users/items according to their popularity. However, manually selecting embedding sizes in recommender systems can be very challenging due to the large number of users/items and the dynamic nature of their popularity. Thus, in this paper, we propose an AutoML based end-to-end framework (AutoEmb), which can enable various embedding dimensions according to the popularity in an automated and dynamic manner. To be specific, we first enhance a typical DLRS to allow various embedding dimensions; then we propose an end-to-end differentiable framework that can automatically select different embedding dimensions according to user/item popularity; finally we propose an AutoML based optimization algorithm in a streaming recommendation setting. The experimental results based on widely used benchmark datasets demonstrate the effectiveness of the AutoEmb framework.


page 1

page 2

page 3

page 4


Memory-efficient Embedding for Recommendations

Practical large-scale recommender systems usually contain thousands of f...

Convolutional Gaussian Embeddings for Personalized Recommendation with Uncertainty

Most of existing embedding based recommendation models use embeddings (v...

On the Dimensionality of Embeddings for Sparse Features and Data

In this note we discuss a common misconception, namely that embeddings a...

Quantifying and Mitigating Popularity Bias in Conversational Recommender Systems

Conversational recommender systems (CRS) have shown great success in acc...

Single-shot Embedding Dimension Search in Recommender System

As a crucial component of most modern deep recommender systems, feature ...

A Re-visit of the Popularity Baseline in Recommender Systems

Popularity is often included in experimental evaluation to provide a ref...

Searching for Apparel Products from Images in the Wild

In this age of social media, people often look at what others are wearin...

1. Introduction

Driven by the recent advances in deep learning, there have been increasing interests in developing deep learning based recommender systems (DLRSs) (Zhang et al., 2017; Nguyen et al., 2017; Wu et al., 2016). DLRSs have boosted the recommendation performance because of their capacity of effectively catching the non-linear user-item relationships, and learning the complex abstractions as data representations (Zhang et al., 2019). Architectures of DLRS often mainly consist of three key components: (i) embedding layers

that map raw user/items features in a high dimensional space to dense vectors in a low dimensional

embedding space, (ii) hidden layers that perform nonlinear transformations to transform the input features, and (iii) output layers

that make predictions for specific recommendation tasks (e.g. regression and classification) based on the representations from hidden layers. The majority of existing researches have focused on designing sophisticated neural network architectures for the hidden layers and output layers, while the embedding layers have not gained much attention. However, in the large-scale real-world recommender systems with numerous users and items, embedding layers play a tremendously crucial role in accurate recommendations. The most typical use of embedding is to transform an identifier, i.e., user-id or item-id, into a real-valued vector. Each embedding can be considered as the latent representation of a specific user or item. Compared to hand-crafted features, well-learned embeddings have been demonstrated to significantly enhance the recommendation performance 

(Cheng et al., 2016; Guo et al., 2017; Pan et al., 2018; Qu et al., 2016; Zhou et al., 2018). This is because embeddings can reduce the dimensionality of categorical variables (e.g. one-hot identifiers) and meaningfully represent users/items in the latent space. Furthermore, nearest neighbors in the embedding space can be viewed as similar users/items; while the mapping of one-hot space is completely uninformed where similar users/items are not projected closer to each other.

The majority of existing DLRSs often adopt a unified and fixed dimensionality in their embedding layers. In other words, all users (or items) share the same and fixed embedding size. It naturally raises a question – do we need different embedding dimensions for different users/items? To investigate this question, we conduct a preliminary study on the movielens-20m dataset 111 For each user, we first select a fixed part of his/her ratings (labeled interactions with items) as test and then we choose ratings as training. Figure 1 illustrates how the recommendation performance of a typical DLRS (Cheng et al., 2016) with embedding dimensions , and in terms of the mean-squared-error (MSE) and accuracy changes when we vary . Lower MSE (or higher accuracy) means better performance. Note that we refer to the number of interactions users/items have as popularity in this work. From the figure, with the increasing of the popularity , (i) the performance of models with different embedding dimensions increases but larger embedding dimensions gain more; and (ii) smaller embedding dimensions first work better and then are outperformed by larger embedding dimensions. These observations are quite expected since the embedding size often determines the number of model parameters to learn and the capacity to encode information by the embedding. On the one hand, smaller embedding dimensions often mean fewer model parameters and lower capacity. Thus, they can work well when the popularity is small. However, the capacity limits the performance with the increasing popularity when the embedding needs to encode more information. On the other hand, larger embedding dimensions usually indicate more model parameters and higher capacity. They typically need sufficient data to be well trained. Therefore they cannot work well when the popularity is small but they have the potential to capture more information as the popularity increases. Given that users/items have very different popularity in a recommender system, different embedding dimensions should be allowed by DLRSs. This property is highly desired in practice since real-world recommender systems are streaming where the popularity is highly dynamic. For example, new interactions are rapidly occurred and new users/items are continuously added.

In this paper, we aim to enable different embedding dimensions for different users/items at the embedding layer under the streaming setting. We face tremendous challenges. First, the number of users/items in real-world recommender systems is very large and the popularity is highly dynamics, it is hard, if possible, to manually select different dimensions for different users/items. Second, the input dimension of the first hidden layer in existing DLRSs is often unified and fixed, it is difficult for them to accept different dimensions from the embedding layers. Our attempt to solve these challenges leads to an end-to-end differentiable AutoML based framework (AutoEmb), which can make use of various embedding dimensions in an automated and dynamic manner. Our experiments in real-world e-commerce data demonstrate the effectiveness of the proposed framework.

Figure 2. The basic DLRS architecture.

2. Framework

As discussed in Section 1, assigning different embedding sizes to different users/items in the streaming recommendation setting faces two challenges. To address these challenges, we propose an AutoML based end-to-end framework, which can automatically and dynamically leverage embedding with various dimensions. In the following, we will first illustrate a widely used DLRS as our basic architecture, then we will enhance it to enable various embedding dimensions, next we will discuss how to automatically and dynamically select the various dimensions, and finally, an AutoML based optimization algorithm will be provided for the streaming recommendation.

2.1. A Basic DLRS Architecture

We illustrate a basic DLRS architecture in Figure 2, which contains three components: (i) the embedding layers that map user/item IDs () into dense and continuous valued embedding vectors (

), (ii) the hidden layers which are fully connected layers that non-linearly transform the embedding vectors (

) into hierarchical feature representations, and (iii) the output layer that generates the prediction for recommendation. Given a user-item interaction, the DLRS first performs embedding-lookup processes according to the user-id and item-id, and concatenates the two embeddings; then the DLRS feeds the concatenated embedding and makes predictions via the hidden and output layers. This typical DLRS architecture is widely used in recent recommendations (Cheng et al., 2016). However, it has fixed neural network architectures, which cannot handle different embedding dimensions. Next, we will enhance this basic DLRS architecture to enable various embedding dimensions.

2.2. The Enhanced DLRS Model

As discussed in Section 1, shorter embeddings with fewer model parameters can generate better recommendations when the popularity is small; while with the increase of popularity, longer embeddings with more model parameters and higher capacity achieve better recommendation performance. Motivated by this observation, assigning different embedding dimensions for users/items with different popularity is highly desired. However, the basic DLRS architecture in Section 2.1 is not able to handle various embedding dimensions because of its fixed neural network architecture.

The basic idea to address this challenge is to transform various embedding dimensions into the same dimension, so that the DLRS can select one of the transformed embeddings according to current user/item popularity. Figure 3 illustrates the embedding transformation and selection process. Suppose we have embedding spaces , and the dimension of an embedding in each space is , where . We define is the set of embeddings for a given user from all embedding spaces. To unify the embeddings vectors , we introduce a component with fully-connected layers, which transform into same dimension :


where are weight matrices and

are bias vectors. After the linear transformations, we have mapped the original embedding vectors

into the same dimension . In practice, we can observe that the magnitude of the transformed embeddings varies significantly, which makes them become incomparable. To tackle this challenge, we conduct BatchNorm (Ioffe and Szegedy, 2015) with the Tanh activation (Karlik and Olgac, 2011) on the transformed embeddings as follows:

Figure 3. The embedding transformation and selection.

where is the mini-batch mean and

is the mini-batch variance for

. is a constant added to the mini-batch variance for numerical stability. the Tanh function activates the normalized embedding into . After BatchNorm and activation, the linearly transformed embeddings become to magnitude-comparable embedding vectors with the same dimension. Given an item , we conduct the same transformations as these in Equations (1) and (2) on its embeddings , and obtain magnitude-comparable ones that share the same dimension.

Figure 4. The controller architecture.

According to the popularity, the DLRS will select a pair of transformed embeddings as the representations of the user and item :


The embedding size is selected by a controller that will be detailed in the next subsection. Then, we concatenate user’s and item’s representations and feed as the input into fully-connected hidden layers:


where is the weight matrix and is the bias vector for the hidden layer. Finally, the output layer, which is subsequent to the hidden layers, generates the prediction of user ’s satisfaction with item as:


where and

are output layer’s weight matrix and bias vector. Activation function

varies according to different prediction tasks, such as Sigmoid function for app installation prediction (regression) 

(Cheng et al., 2016), and Softmax for buy-or-not prediction (classification) (Tan et al., 2016). By minimizing the loss between predictions and labels, the DLRS updates the parameters of all embeddings as well as neural networks through back-propagation.

Figure 5. The Enhanced DLRS architecture.

2.3. The Controller

Inspired by the observation in Figure 1, shorter embeddings perform better if popularity is small and longer embeddings perform better when popularity is large. Therefore, it is highly desired to use different embedding sizes for different users and items. Given a large number of users/items and the dynamic nature of their popularity, it is hard, if possible, to determine embedding sizes manually.

To address this challenge, we propose an AutoML based approach to automatically determine the embedding sizes. To be specific, we design two controller networks that decide the embedding sizes for users and items, separately. Figure 4 illustrates the controller network architecture. For a specific user/item, the controller’s input consists of two parts: (i) the current popularity of the user/item; and (ii) the contextual information

such as the previous hyperparameters and the corresponding loss the user/item obtains. This contextual information can be viewed as signals that measure whether hyperparameters assigned to the user/item previously work well. In other words, if they work well, the new hyperparameters generated for this time should be somewhat similar. The controller takes above-mentioned inputs, transforms them via several layers of fully-connected networks, and then generates hierarchical feature representations. The output layer is a Softmax layer with

output units. In this work, we use to denote the output units for the controller of users and utilize for items. The

unit denotes the probability to select the

embedding space. The embedding space is automatically selected as the one corresponding to the largest probability. It is formally defined as:


With the controller, the task of the embedding dimensionality search reduces to optimize the controllers’ parameters, so as to automatically generate suitable or according to the popularity of a user/item.

2.4. Soft Selection

Eq. (6) performs a hard selection on the embedding spaces. In other words, each time, we only select one embedding space with the largest probability from the controller. This hard selection makes the whole framework not end-to-end differentiable. To tackle this challenge, in this work, we choose a soft selection. For , its embedding is a weighted sum of where the weight of is the corresponding probability from the controller. Therefore, the representations of the user and item can be reformulated as:


With soft selection, the enhanced DLRS is end-to-end differentiable, and we illustrate the whole augmented DLRS architecture in Figure 5 where we add the transformed embedding layer which performs soft selection of embedding spaces and the selection process is determined by two controllers for users and items, respectively.

2.5. An Optimization Method

In this subsection, we investigate the optimization of the proposed framework. With the soft selection, the optimization task is to jointly optimize the parameters of the DLRS, say , and the parameters of controllers, say . Since our framework is end-to-end differentiable, inspired by the concept of differentiable architecture search (DARTS) techniques (Liu et al., 2018a), we adapt a DARTS based optimization for the AutoEmb framework, which updates and by optimizing the training loss and the validation loss through gradient descent, respectively. Note that both training and validation losses are determined not only by the parameters of DLRS, but also the parameters of the controller.

The goal for embedding dimensionality search is to find optimal parameters that minimizes the validation loss , where the parameters of DLRS are obtained by minimizing the training loss . This a bilevel optimization problem (Maclaurin et al., 2015; Pedregosa, 2016; Pham et al., 2018), where is the the upper-level variable and is the lower-level variable:


Optimizing is time-consuming due to the expensive inner optimization of . Therefore, we leverage the approximation scheme as DARTS:


Input: the user-item interactions and the corresponding ground-truth labels
Output: well-learned DLRS parameters ; well-learned controller parameters

1:  while not converged do
2:     Sample a mini-batch of validation data from previous user-item interactions
3:     Update by descending ( for first-order approximation)
4:     Collect a mini-batch of training data
5:     Generate , via collectors with current parameters
6:     Generate predictions via DLRS with current parameters as well as and
7:     Evaluate the predictions and record the performance
8:     Update by descending
9:  end while
Algorithm 1 DARTS based Optimization for AutoEmb.

where is the learning rate for updating

. The approximation scheme estimates

by updating one training step, which avoids completely optimizing to convergence. The first-order approximation with can even lead to some speed-up, but empirically worse performance. It is worth to note that, different from DARTS on computer version tasks, there is no deriving discrete architecture stage, during which DARTS generates a discrete neural network architecture by selecting the most likely operation according to the softmax probabilities. This is because the popularity of users/items is highly dynamic with new user-item interactions occur, which prohibit us from selecting a particular embedding dimension for a user/item.

Object Movielens-20m Movielens-latest Netflix Prize
# user 138,493 283,228 480,189
# item 27,278 58,098 17,770
# interaction 20,000,263 27,753,444 100,480,507
# rating 15 15 15
Table 1. Statistics of the datasets.
Dataset Task Metrics Methods
Regression MSE Loss 0.18400.0003 0.18190.0002 0.18120.0003 0.18030.0002
Accuracy 0.72110.0003 0.72390.0002 0.72450.0002 0.72590.0003
Classification CE Loss 1.14640.0006 1.14230.0002 1.14160.0003 1.13950.0005
Accuracy 0.49430.0003 0.49580.0005 0.4970 0.0002 0.49820.0002
Regression MSE Loss 0.18090.0002 0.18030.0003 0.17970.0002 0.17900.0001
Accuracy 0.72600.0003 0.72750.0001 0.72800.0002 0.72870.0002
Classification CE Loss 1.12570.0003 1.1249 0.0002 1.12420.0002 1.1233 0.0002
Accuracy 0.5049 0.0002 0.50620.0002 0.50710.0001 0.50790.0002
Regression MSE Loss 0.18210.0003 0.18120.0001 0.18070.0001 0.17790.0002
Accuracy 0.72900.0002 0.73020.0002 0.73090.0003 0.73160.0001
Classification CE Loss 1.11020.0003 1.10920.0001 1.10850.0003 1.10760.0002
Accuracy 0.50960.0003 0.51090.0003 0.51190.0001 0.51270.0002
Table 2. Performance comparison of different embedding selection methods

We present our DARTS based optimization algorithm in Algorithm 1. In each iteration, we first update controllers’ parameters upon validation set collected from previous user-item interactions (line 2-3), then we collect a new mini-batch of user-item interactions as training data (line 4); next we produce hyper-parameters and via collectors with its current parameters for the training examples(line 5); then we make predictions via DLRS with its current parameters and the assistance of hyper-parameters (line 6); next we evaluate the prediction performance and record it (line 7); and finally, we update DLRS’s parameters.

It is worth to note that, in the batch-based streaming recommendation setting, the optimization process follows an “evaluate, train, evaluate, train…” fashion (Chang et al., 2017). In other words, we always continuously collect new user-item interaction data; when we have a full mini-batch of examples, we first make predictions based on our AutoEmb framework with its current parameters, and evaluate the performance of the predictions and record it; then we update the parameters of AutoEmb by minimizing the loss between the predictions and ground truth labels; next we collect another mini-batch of user-item interactions and perform the same process. Therefore, there is no pre-split validation set and test set. In other words, (i) to calculate , we sample a mini-batch of previous user-item interactions as the validation set; (ii) there is no independent test stage, during which we fix all the parameters and evaluate the proposed framework on examples in the pre-split test set; and (iii) following the streaming recommendation setting in (Chang et al., 2017), we also have offline parameter estimation and online inference stages, where we use historical user-item interactions to pre-train the AutoEmb’s parameters in the offline parameter estimation stage, and then we launch the AutoEmb online and continuously update the AutoEmb parameters in the online inference stage. In other words, AutoEmb’s parameters are updated in both stages following the Algorithm 1.

3. Experiments

In this section, we conduct extensive experiments to evaluate the effectiveness of the proposed AutoEmb framework. We mainly focus on two questions: (i) how the proposed framework performs compared to representative baselines; and (ii) how the controller contributes to the performance. We first introduce experimental settings. Then we seek answers to the above two questions. Finally, we study the impact of important parameters on the performance of the proposed framework.

3.1. Datasets

We evaluate our method on widely used dataset: Movielens-20m222, Movielens-latest333 and Netflix Prize data444 Some key statistics of the datasets are shown in Table 1. For each dataset, we use 70% user-item interactions for offline parameter estimation and the other 30% for online learning. To demonstrate the effectiveness of our framework in the embedding selection task, we eliminate other contextual features, e.g., users’ age and items’ category, to exclude the influence of other features, but it is straightforward to incorporate them into the framework for better recommendations.

3.2. Implement Details

Next we detail the architecture of DLRS and controllers. For DLRS, (i) embedding layer: we select sizes of embedding dimension , thus dimension of transformed embeddings is . We concatenate the three embeddings of each user/item, which significantly improves the embedding lookup speed; (ii) hidden layer: we have two hidden layers with the size and ; (iii) output layer: we do two types of tasks, for rating regression task, the output layer is , and for rating classification task, the output layer is with Softmax activation, because there are 5 classes of ratings. For controllers, (i) input layer: the input feature size is 38; (ii) hidden layer: we have two hidden layers with the size and ; (iii) output layer: the shape is with Softmax activation to generate the weights of sizes of embeddings. The batch-size is 500. The learning rate for DLRS and controllers are and , respectively. For the parameters of the proposed framework, we select them via cross-validation. Correspondingly, we also do parameter-tuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections.

3.3. Evaluation Metrics

We conduct two types of tasks to demonstrate the effectiveness of the AutoEmb framework. For the regression task, we first binarize ratings to

, and then train the framework via minimizing the mean-squared-error (MSE) loss. The performance can be evaluated by MSE Loss and accuracy (we use 0.5 as threshold to assign the labels). For the classification task, the ratings 15 are viewed as 5 classes, and the framework is trained by minimizing the cross-entropy loss (CE Loss). The performance is measured by cross-entropy and accuracy.

Figure 6. Performance with different popularities.

3.4. Overall Performance in the Online Stage

We compare the proposed framework with the following representative baseline methods:

  • [leftmargin=*]

  • Fixed-size Embedding (FSE): In this baseline, we assign a fixed embedding size for all the users/items. For a fair comparison, we set the embedding size as . In other words, it occupies the same embedding space memory as AutoEmb.

  • Supervised Attention Model (


    ): This baseline has the exact same architecture with AutoEmb, while we simultaneously update the parameters of DLRS and controllers on the same batch of training data, via an end-to-end supervised learning manner.

  • Differentiable architecture search (DARTS): This baseline is a standard DARTS method, which trains real-valued weights for the three types of embedding dimensions.

It is worth to note that, Neural Input Search model (Joglekar et al., 2019) and Mixed Dimension Embedding model (Ginart et al., 2019) cannot be applied in the streaming recommendation setting, because they assume that the popularities of users/items are pre-known and fixed, and then assign highly-popular users/items with large embedding dimensions. However, in real-world streaming recommender systems, the popularities are not pre-known but highly dynamic.

The overall results of the online stage are shown in Table 2. We make the following observations: (i) SAM performs better than FSE, since SAM assigns attention weights on embeddings with different dimensions according to popularity, while FSE has a fixed embedding dimension for all the users/items. These results demonstrate that recommendation quality is indeed related to the popularity of users/items, and introducing different embedding dimensions and adjusting the weights on them according to popularity can boost the recommendation performance. (ii) DARTS outperforms SAM, because AutoML models like DARTS update controller’s parameters on the validation set, which can enhance the generalization, while end-to-end models like SAM update the parameters of controllers simultaneously with DLRS, on the same batch training data, which may lead to overfitting. These results validate the effectiveness of AutoML techniques over conventional supervised learning in recommendations. (iii) Our proposed model AutoEmb has better performance than standard DARTS model. DARTS separately trains real-valued weights for each user/item on the three types of embedding dimensions. These weights of a specific user/item may not be well-trained because of the limited interactions of this user/item. The controller of AutoEmb can incorporate huge amounts of user-item interactions and capture the important characteristics from them. Also, the controller has an explicit input of popularity, which may assist the controller to learn the dependency between popularity and embedding dimensions, which DARTS cannot. These results demonstrate the necessity of developing a controller rather than only real-valued weights. (iv) After the offline parameter estimation stage, most users/items in the online stage have already become very popular. In other words, AutoEmb has stable improvement for popular users/items. AutoEmb has even more significant enhancement in the early training stage, and we will discuss that in the following sections.

To sum up, we can draw an answer to the first question: the proposed framework outperforms representative baselines on different datasets with different metrics. These results prove the effectiveness of the AutoEmb framework.

3.5. Performance with Popularity

Now, we will investigate whether the proposed controller can generate proper weights according to various popularity. Thus, we compare FSE without a controller, SAM with a supervised-attentive controller, and AutoEmb with an AutoML based controller. The results on Movielens-20m dataset are shown in Figure 6, where -axis is popularity and -axis corresponds to performance, we omit the similar results on other datasets due to the limited space.

We make following observations: (i) When popularity is small, FSE performs worse than SAM and AutoEmb. This is because larger embeddings need sufficient data to be well learned. Smaller embeddings with fewer parameters can quickly capture some high-level characteristics, which can help the cold-start predictions. (ii) With the increase of popularity, FSE outperforms SAM. This result is interesting but instructive, and the reason may be that, SAM’s controller overfits to a small number of training examples, which leads to suboptimal performance. On the contrary, AutoEmb’s controller was trained on validation set, which improves its generalization. This reason is also be validated in the following subsection. (iii) AutoEmb always outperforms FSE and SAM, which means the proposed framework is able to automatically and dynamically adjust the weights on embeddings with different dimensions according to the popularity. (iv) To further probe the weights generated by the AutoEmb’s controller according to popularity, we draw the distribution of weights for various popularity in Figure 7. We can observe that, the distribution shows on the small embeddings for small popularity, and shows on larger embeddings with the increase of popularity. This observation validates our above analysis. In summary, we can answer the second question: the controller of AutoEmb can produce reasonable weights for different popularity via an automated and dynamic manner.

Figure 7. The weights on different embedding dimensions with different popularities (ppl).
Figure 8. The offline parameter estimation process.

3.6. Performance with Data

Training deep learning based recommender systems typically requires a large amount of user-item interaction data. Our proposed AutoEmb framework introduces an additional controller network as well as some additional parameters in the DLRS, which may make it hard to be well trained. We show the optimization process in Figure 8, where -axis is the number of training examples, and -axis corresponds to the performance.

It can be observed: (i) In the early training stage, SAM performs worst since its controller overfits to the insufficient training examples, which is also validated in Section 3.5. (ii) The overfitting problem of SAM gradually is mitigated with more data coming, and SAM outperforms FSE, which validates the necessity of weights on various embedding dimensions. (iii) AutoML outperforms SAM and FSE in the whole training process. Especially, it can significantly boost the early training stage with insufficient training examples.

4. Related Work

In this section, we briefly review works related to our study. In general, the related work can be grouped into the following categories.

The first category related to this paper is deep learning based recommender system, which is able to effectively capture the non-linear and non-trivial user-item relationships, and enables the codification of more complex abstractions as data representations in the higher layers (Zhang et al., 2017). In recent years, a series of neural recommendation models based on deep learning techniques have been proposed with evident performance lifting. He et al.(He et al., 2017) firstly proposed Neural Collaborative Filtering (NCF) which utilizes a dual neural network to represent a two-way interaction between user preferences and items features. Guo et al.(Guo et al., 2017)

proposed an end-to-end model named DeepFM to integrate factorization machine and Multilayer Perceptron (MLP) seamlessly. Xu et al.

(Xu et al., 2016)(Xu et al., 2017) fused tag annotations into personalized recommendation and proposed Deep Semantic Similarity based Personalized Recommendation (DSPR). Ali et al.(Elkahky et al., 2015) designed Multi-View Deep Neural Network (MV-DNN) which could model the interactions among users and items with multiple domains. Besides simple expression or transformation of MLP, there are also some other typical models with deep learning methods. Suvash et al.(Sedhain et al., 2015)

introduced AutoRec along with item-based and user-based one to learn the lower-dimension feature representations of users and items. What’s more, Convolutional Neural Network (CNN) could be a strong support in extracting features for recommendation. Lei et al.

(Lei et al., 2016) utilized CNN in image recommendation. Zheng et al.(Zheng et al., 2017) used two parallel CNNs to model user and item features from review texts. Hidasi et al.(Hidasi et al., 2015) firstly proposed a session-based recommendation model named GRU4Rec to model the sequential influence of items’ transition. Lee et al.(Lee et al., 2016) proposed a hybrid model that integrates RNNs with CNNs for quotes recommendation. However, most of these works focus on designing sophisticated neural network architectures, while have not paid much attention to the embedding layers.

The second category is about AutoML for Neural Architecture Search (NAS), which has raised much attention since (Zoph and Le, 2016)

, which adopts reinforcement learning approach with recurrent neural network (RNN) to train a large number of candidate models for convergence. Due to the high cost in training, a lot of research focus on proposing novel NAS model with lower hardware resources supporting. One direction is to sample a subset of all model components so that the optimal set could be learned with limited training steps. For instance, ENAS

(Pham et al., 2018) leverages a controller to sample subset of models, and SMASH(Brock et al., 2017) uses a hyper-network to generate weights for sampled networks. DARTS(Liu et al., 2018a) and SNAS(Xie et al., 2018) regard the connection as a weight via back-propagation to optimize. Luo et al.(Luo et al., 2018) reflects the neural architectures into an embedding space so that the optimal embedding could be learned and given as feedback to the final architecture. Another direction is to reduce the size of the search space. (Real et al., 2019; Zhong et al., 2018; Liu et al., 2018b; Cai et al., 2018) propose searching convolution cells which could be stacked repeatedly. Zoph et al.(Zoph et al., 2018)

developed NASNet architecture which use a transfer learning setting to show smaller datasets could perform better than larger datasets. MNAS

(Tan et al., 2019) proposed a hierarchical convolution cell block which could learn different structures. NAS has been widely used in different tasks, such as image classification(Real et al., 2017), natural language understanding(Chen et al., 2020), etc. Joglekar(Joglekar et al., 2019) firstly utilized NAS into large scale recommendation models and proposed a novel type of embedding named as Multi-size Embedding (ME). However, it cannot be applied in the streaming recommendation setting, where the popularity is not pre-known but highly dynamic.

5. Conclusion

In this paper, we propose a novel framework AutoEmb, which aims to select different embedding dimensions according to user’s/item’s popularity in an automated and dynamic manner. In practical streaming recommender systems, due to the huge amounts of users/items and the dynamic nature of their popularity, it is hard, if possible, to manually select different dimensions for different users/items. Therefore, we proposed an AutoML based framework to automatically select from different embedding dimensions. To be specific, we first augment a widely used DLRS architecture to enable it to accept various embedding dimensions, then we propose a controller network, which could automatically select embedding dimensions for a specific user/item according to its current popularity. We evaluate our framework with extensive experiments based on widely used benchmark datasets. The results show that (i) our framework can significantly improve the recommendation performance with highly dynamic popularity; and (ii) the controller trained via an AutoML manner can dramatically enhance the training efficiency, especially when interaction data is insufficient.

There are several interesting research directions. First, in addition to automatically determine the embedding dimensions, we would like to investigate the method to automatically design the whole DLRS architecture. Second, we select the embedding dimensions via a soft manner, which makes the framework end-to-end differentiable, but needs a larger embedding space. In the future, we would like to develop an end-to-end framework with hard selection. Third, the framework is quite general to handle categorical features, thus we would like to investigate more applications of the proposed framework. Finally, we would like to develop a framework that can incorporate more types of features in addition to categorical ones.