I Introduction
With the explosive growth of the worldwide web, huge amounts of data have been generated, which results in the increasingly severe information overload problem, potentially overwhelming users [5]. Recommender systems can mitigate the information overload problem through suggesting personalized items that best match users’ preferences [20, 2, 25, 32, 33, 1]
. Recent years have witnessed the increased development and popularity of deep learning based recommender systems (DLRSs)
[45, 26, 41], which outperform traditional recommendation techniques, such as collaborative filtering and learningtorank, because of their strong capability of feature representation and deep inference [44].Realworld recommender systems typically involve a massive amount of categorical feature fields from users (e.g. occupation and userID), items (e.g. category and itemID), contextual information (e.g. time and location), and their interactions (e.g. user’s purchase history of items). DLRSs first map these categorical features into realvalued dense vectors via an
embeddingcomponent [27, 29, 49], i.e., the embeddinglookup process, which leads to huge amounts of embedding parameters. For instance, the YouTube recommender system consists of 1 million of unique videoIDs, and assign each videoID with a specific 256dimensional embedding vector; in other words, the videoID feature field alone occupies 256 million parameters [8]. Then, the DLRSs nonlinearly transform the input embeddings form all feature fields and generate the outputs (predictions) via the MLPcomponent(MultiLayer Perceptron), which usually involves only several fully connected layers in practice. Therefore, compared to the MLPcomponent, the embeddingcomponent dominates the number of parameters in practical recommender systems, which naturally plays a tremendously impactful role in the recommendations.
The majority of existing recommender systems assign fixed and unified embedding dimension for all feature fields, such as the famous Wide&Deep model [7], which may lead to memory inefficiency. First, the embedding dimension often determines the capacity to encode information. Thus, allocating the same dimension to all feature fields may lose the information of high predictive features while wasting memory on nonpredictive features. Therefore, we should assign large dimension to the high informative and predictive features, for instance, the “location” feature in locationbased recommender systems [1]. Second, different feature fields have different cardinality (i.e. the number of unique values). For example, the gender feature has only two (i.e. male and female), while the itemID feature usually involves millions of unique values. Intuitively, we should allocate larger dimensions to the feature fields with more unique feature values to encode their complex relationships with other features, and assign smaller dimensions to feature fields with smaller cardinality to avoid the overfitting problem due to the overparameterization [16, 9, 47, 17]. According to the above reasons, it is highly desired to assign different embedding dimensions to different feature fields in a capacity and memoryefficient manner.
In this paper, we aim to enable different embedding dimensions for different feature fields for recommendations. We face tremendous challenges. First, the relationship among embedding dimensions, feature distributions and neural network architectures is highly intricate, which makes it hard to manually assign embedding dimensions to each feature field [9]. Second, realworld recommender systems often involve hundreds and thousands of feature fields. It is difficult, if possible, to artificially select different dimensions for each feature field via traditional techniques (e.g. autoencoder [18]), due to the huge computation cost from the numerous featuredimension combinations. Our attempt to address these challenges results in an endtoend differentiable AutoML based framework (AutoDim), which can efficiently allocate embedding dimensions to different feature fields in an automated and datadriven manner. Our experiments on benchmark datasets demonstrate the effectiveness of the proposed framework. We summarize our major contributions as: (i) we propose an endtoend AutoML based framework AutoDim, which can automatically select various embedding dimensions to different feature fields; (ii) we develop two embedding lookup methods and two embedding transformation approaches, and compare the impact of their combinations on the embedding dimension allocation decision; and (iii) we demonstrate the effectiveness of the proposed framework on realworld benchmark datasets.
The rest of this paper is organized as follows. In Section 2, we introduce details about how to assign various embedding dimensions for different feature fields in an automated and datadriven fashion, and propose an AutoML based optimization algorithm. Section 3 carries out experiments based on realworld datasets and presents experimental results. Section 4 briefly reviews related work. Finally, section 5 concludes this work and discusses our future work.
Ii Framework
In order to achieve the automated allocation of different embedding dimensions to different feature fields, we propose an AutoML based framework, which effectively addresses the challenges we discussed in Section I. In this section, we will first introduce the overview of the whole framework; then we will propose an endtoend differentiable model with two embeddinglookup methods and two embedding dimension search methods, which can compute the weights of different dimensions for feature fields in a soft and continuous fashion, and we will provide an AutoML based optimization algorithm; finally, we will derive a discrete embedding architecture upon the maximal weights, and retrain the whole DLRS framework.
Notation  Definition 

feature from the feature field  
embedding space of the feature field  
dimension of the embedding space  
candidate embedding of feature  
embedding in the weightsharing method  
minibatch mean  
minibatch variance 

constant for numerical stability  
probability to select candidate dimension  
embedding of feature to be fed into MLP  
parameters of DLRS  
weights on different embedding spaces  
Linear Transformation  
embedding vector after linear transformation  
weight matrix of linear transformation  
bias vector of linear transformation  
embedding vector after batchnormalization 

Zero Padding 

embedding vector after batchnormalization  
embedding vector after zero padding 
Iia Overview
Our goal is to assign different feature fields various embedding dimensions in an automated and datadriven manner, so as to enhance the memory efficiency and the performance of the recommender system. We illustrate the overall framework in Figure 2, which consists of two major stages:
IiA1 Dimensionality search stage
It aims to find the optimal embedding dimension for each feature field. To be more specific, we first assign a set of candidate embeddings with different dimensions to a specific categorical feature via an embeddinglookup step; then, we unify the dimensions of these candidate embeddings through a transformation step, which is because of the fixed input dimension of the first MLP layer; next, we obtain the formal embedding for this categorical feature by computing the weighted sum of all its transformed candidate embeddings, and feed it into the MLPcomponent. The DLRS parameters including the embeddings and MLP layers are learned upon the training set, while the architectural weights over the unified candidate embeddings are optimized upon the validation set, which prevents the framework selecting the embedding dimensions that overfit the training set [28, 23].
IiA2 Parameter retraining stage
According to the architectural weights learned in dimensionality search, we select the embedding dimension for each feature field, and retrain the parameters of DLRS parameters (i.e. embeddings and MLPs) on the training dataset in an endtoend fashion.
Table I summarizes the key notations of this work. Note that numerical features will be converted into categorical features through bucketing, and we omit this process in the following sections for simplicity. Next, we will introduce the details of each stage.
IiB Dimensionality Search
As discussed in Section I, different feature fields have different cardinalities and various contributions to the final prediction. Inspired by this phenomenon, it is highly desired to enable various embedding dimensions for different feature fields. However, due to the large amount of feature fields and the complex relationship between embedding dimensions with feature distributions and neural network architectures, it is difficult to manually select embedding dimensions via conventional dimension reduction methods. An intuitive solution to tackle this challenge is to assign several embedding spaces with various dimensions to each feature field, and then the DLRS automatically selects the optimal embedding dimension for each feature field.
IiB1 Embedding Lookup Tricks
Suppose for each useritem interaction instance, we have input features , and each feature belongs to a specific feature field, such as gender and age, etc. For the feature field, we assign embedding spaces . The dimension of an embedding in each space is , where ; and the cardinality of these embedding spaces are the number of unique feature values in this feature field. Correspondingly, we define as the set of candidate embeddings for a given feature from all embedding spaces, as shown in Figure 3 (a). Note that we assign the same candidate dimension to all feature fields for simplicity, but it is straightforward to introduce different candidate sets. Therefore, the total space assigned to the feature is . However, in realworld recommender systems with thousands of feature fields, two challenges lie in this design include (i) this design needs huge space to store all candidate embeddings, and (ii) the training efficiency is reduced since a large number of parameters need to be learned.
To address these challenges, we propose an alternative solution for largescale recommendations, named weightsharing embedding architecture. As illustrated in Figure 3 (b), we only allocate a dimensional embedding to a given feature , referred as to , then the candidate embedding corresponds to the first digits of . The advantages associated with weightsharing embedding method are twofold, i.e., (i) it is able to reduce the storage space and increase the training efficiency, as well as (ii) since the relatively front digits of have more chances to be retrieved and then be trained (e.g. the “red part” of is leveraged by all candidates in Figure 3 (b)), we intuitively wish they can capture more essential information of the feature .
IiB2 Unifying Various Dimensions
Since the input dimension of the first MLP layer in existing DLRSs is often fixed, it is difficult for them to handle various candidate dimensions. Thus we need to unify the embeddings into same dimension, and we develop two following methods:
Method 1: Linear Transformation
Figure 4 (a) illustrates the linear transformation method to handle the various embedding dimensions (the difference of two embedding lookup methods is omitted here). We introduce fullyconnected layers, which transform embedding vectors into the same dimension :
(1) 
where is weight matrice and is bias vector. With the linear transformations, we map the original embedding vectors into the same dimensional space, i.e., . In practice, we can observe that the magnitude of the transformed embeddings varies significantly, which makes them become incomparable. To tackle this challenge, we conduct BatchNorm [14] on the transformed embeddings as:
(2) 
where is the minibatch mean and is the minibatch variance for . is a constant added to the minibatch variance for numerical stability. After BatchNorm, the linearly transformed embeddings become to magnitudecomparable embedding vectors with the same dimension .
Method 2: Zero Padding
Inspired by zero padding techniques from the computer version community, which pads the input volume with zeros around the border, we address the problem of various embedding dimensions by padding shorter embedding vectors to the same length as the longest embedding dimension with zeros, which is illustrated in Figure 4 (b). For the embedding vectors with different dimensions, we first execute BatchNorm process, which forces the original embeddings into becoming magnitudecomparable embeddings:
(3) 
where , are the minibatch mean and variance. is the constant for numerical stability. The transformed are magnitudecomparable embeddings. Then we pad the to the same length by zeros:
(4) 
where the second term of each padding formula is the number of zeros to be padded with the embedding vector of the first term. Then the embeddings share the same dimension . Compared with the linear transformation (method 1), the zero padding method reduces lots of lineartransformation computations and corresponding parameters. The possible drawback is that the final embeddings becomes spatially unbalanced since the tail parts of some final embeddings are zeros. Next, we will introduce embedding dimension selection process.
IiB3 Dimension Selection
In this paper, we aim to select the optimal embedding dimension for each feature field in an automated and datadriven manner. This is a hard (categorical) selection on the embedding spaces, which will make the whole framework not endtoend differentiable. To tackle this challenge, in this work, we approximate the hard selection over different dimensions via introducing the Gumbelsoftmax operation [15], which simulates the nondifferentiable sampling from a categorical distribution by a differentiable sampling from the Gumbelsoftmax distribution.
To be specific, suppose weights are the class probabilities over different dimensions. Then a hard selection can be drawn via the the gumbelmax trick [10] as:
(5)  
The gumbel noises are i.i.d samples, which perturb terms and make the operation that is equivalent to drawing a sample by weights. However, this trick is nondifferentiable due to the operation. To deal with this problem, we use the softmax function as a continuous, differentiable approximation to operation, i.e., straightthrough gumbelsoftmax [15]:
(6) 
where is the temperature parameter, which controls the smoothness of the output of gumbelsoftmax operation. When approaches zero, the output of the gumbelsoftmax becomes closer to a onehot vector. Then is the probability of selecting the candidate embedding dimension for the feature , and its embedding can be formulated as the weighted sum of :
(7) 
We illustrate the weighted sum operations in Figure 4 and 5. With gumbelsoftmax operation, the dimensionality search process is endtoend differentiable. The discrete embedding dimension selection conducted based on the weights will be detailed in the following subsections.
Then, we concatenate the embeddings and feed input into multilayer perceptron layers:
(8) 
where and are the weight matrix and the bias vector for the MLP layer.
is the activation function such as
ReLU and Tanh. Finally, the output layer that is subsequent to the last MLP layer, produces the prediction of the current useritem interaction instance as:(9) 
where and are the weight matrix and bias vector for the output layer. Activation function
is selected based on different recommendation tasks, such as Sigmoid function for regression
[7], and Softmax for multiclass classification [37]. Correspondingly, the objective function between prediction and ground truth label also varies based on different recommendation tasks. In this work, we leverage negative loglikelihood function:(10) 
where is the ground truth (1 for like or click, 0 for dislike or nonclick). By minimizing the objective function , the dimensionality search framework updates the parameters of all embeddings, hidden layers, and weights through backpropagation. The highlevel idea of the dimensionality search is illustrated in Figure 2 (a), where we omit some details of embeddinglookup, transformations and gumbelsoftmax for the sake of simplicity.
IiC Optimization
In this subsection, we will detail the optimization method of the proposed AutoDim framework. In AutoDim, we formulate the selection over different embedding dimensions as an architectural optimization problem and make it endtoend differentiable by leveraging the Gumbelsoftmax technique. The parameters to be optimized in AutoDim are twofold, i.e., (i) : the parameters of the DLRS, including the embeddingcomponent and the MLPcomponent; (ii) : the weights on different embedding spaces ( are calculated based on as in Equation (6)). DLRS parameters and architectural weights can not be optimized simultaneously on training dataset as conventional supervised attention mechanism since the optimization of them are highly dependent on each other. In other words, simultaneously optimization may result in overfitting on the examples from the training dataset.
Our optimization method is based on the differentiable architecture search (DARTS) techniques [23], where and are alternately optimized through gradient descent. Specifically, we alternately update by optimizing the loss on the training data and update by optimizing the loss on the validation data:
(11)  
this optimization forms a bilevel optimization problem [28], where architectural weights and DLRS parameters are identified as the upperlevel variable and lowerlevel variable. Since the inner optimization of is computationally expensive, directly optimizing via Eq.(11) is intractable. To address this challenge, we take advantage of the approximation scheme of DARTS:
(12) 
where is the learning rate. In the approximation scheme, when updating via Eq.(12
), we estimate
by descending the gradient for only one step, rather than to optimize thoroughly to obtain . In practice, it usually leverages the firstorder approximation by setting , which can further enhance the computation efficiency. The DARTS based optimization algorithm for AutoDim is detailed in Algorithm 1. Specifically, in each iteration, we first sample a batch of useritem interaction data from the validation set (line 2); next, we update the architectural weights upon it (line 3); afterward, the DLRS make the predictions on the batch of training data with current DLRS parameters and architectural weights (line 5); eventually, we update the DLRS parameters by descending (line 6).IiD Parameter ReTraining
In this subsection, we will introduce how to select optimal embedding dimension for each feature field and the details of retraining the recommender system with the selected embedding dimensions.
IiD1 Deriving Discrete Dimensions
During retraining, the gumbelsoftmax operation is no longer used, which means that the optimal embedding space (dimension) are selected for each feature field as the one corresponding to the largest weight, based on the welllearned . It is formally defined as:
(13) 
Figure 2 (a) illustrates the architecture of AutoDim framework with a toy example about the optimal dimension selections based on two candidate dimensions, where the largest weights corresponding to the , and feature fields are , and , then the embedding space , and are selected for these feature fields. The dimension of an embedding vector in these embedding spaces is , and , respectively.
IiD2 Model Retraining
As shown in Figure 2 (b), given the selected embedding spaces, we can obtain unique embedding vectors for features . Then we concatenate these embeddings and feeds them into hidden layers. Next, the prediction
is generated by the output layer. Finally, all the parameters of the DLRS, including embeddings and MLPs, will be updated via minimizing the supervised loss function
through backpropagation. The model retraining algorithm is detailed in Algorithm 2. Note that, (i) the retraining process is based on the same training data as Algorithm 1, (ii) the input dimension of the first hidden layer is adjusted according to the new embedding dimensions in model retraining stage.Data  MovieLens1m  Avazu  Criteo 

# Interactions  1,000,209  13,730,132  23,490,876 
# Feature Fields  26  22  39 
# Sparse Features  13,749  87,249  373,159 
# Pos Ratio  0.58  0.5  0.5 
Iii Experiments
In this section, we first introduce experimental settings. Then we conduct extensive experiments to evaluate the effectiveness of the proposed AutoDim framework. We mainly seek answers to the following research questions  RQ1: How does AutoDim perform compared with representative baselines? RQ2: How do the components, i.e., 2 embedding lookup methods and 2 transformation methods, influence the performance? RQ3: What is the impact of important parameters on the results? RQ4: Which features are assigned large embedding dimension? RQ5: Can the proposed AutoDim be utilized by other widely used deep recommender systems?
Iiia Datasets
We evaluate our model on widely used benchmark datasets:

[leftmargin=*]

MovieLens1m^{1}^{1}1https://grouplens.org/datasets/movielens/1m/
: This is a benchmark for evaluating recommendation algorithms, which contains users’ ratings on movies. The dataset includes 6,040 users and 3,416 movies, where each user has at least 20 ratings. We binarize the ratings into a binary classification task, where ratings of 4 and 5 are viewed as positive and the rest as negative. After preprocessing, there are 26 categorical feature fields.

Avazu^{2}^{2}2https://www.kaggle.com/c/avazuctrprediction/: Avazu dataset was provided for the CTR prediction challenge on Kaggle, which contains 11 days’ user clicking behaviors that whether a displayed mobile ad impression is clicked or not. There are 22 categorical feature fields including user/ad features and device attributes. Parts of the fields are anonymous.

Criteo^{3}^{3}3https://www.kaggle.com/c/criteodisplayadchallenge/: This is a benchmark industry dataset for the purpose of evaluating ad clickthrough rate prediction models. It consists of 45 million users’ click records on displayed ads over one month. For each data example, it contains 13 numerical feature fields and 26 categorical feature fields. We normalize numerical features by transforming a value if as proposed by the Criteo Competition winner ^{4}^{4}4https://www.csie.ntu.edu.tw/ r01922136/kaggle2014criteo.pdf, and then convert it into categorical features through bucketing.
Since the labels of Criteo and Avazu are extremely imbalanced, we conduct downsampling on negative samples to keep the positive ratio at 50%. Features in a specific field appearing less than 30 times are treated as a special dummy feature [21]. Some key statistics of the datasets are shown in Table II. For each dataset, we use 90% useritem interactions as the training/validation set and the rest 10% as the test set.
Datasets  Metrics  Methods  
MiD  MaD  RaS  SAM  AutoDim  

AUC %  76.91 0.033  77.42 0.061  76.96 0.127  77.12 0.056  77.61 0.056  
Logloss  0.570 0.002  0.565 0.002  0.569 0.006  0.567 0.003  0.561 0.002  
Space %  20  100  61.16 12.12  88.82 5.721  36.20 7.635  
Avazu  AUC %  74.33 0.034  74.61 0.025  74.52 0.046  74.59 0.027  74.70 0.035  
Logloss  0.593 0.003  0.591 0.002  0.593 0.004  0.592 0.003  0.587 0.002  
Space %  20  100  56.75 9.563  95.92 2.355  29.60 3.235  
Criteo  AUC %  76.72 0.008  77.53 0.010  77.16 0.142  77.27 0.007  77.51 0.009  
Logloss  0.576 0.003  0.568 0.002  0.572 0.002  0.571 0.001  0.569 0.002  
Space %  20  100  65.36 15.22  93.45 5.536  43.97 9.432 
IiiB Implement Details
Next we detail the AutoDim architectures. For the DLRS, (i) embedding component: for each feature field, we select from candidate embedding dimensions , thus dimension of transformed embedding is . In the separate embedding setting, we concatenate the all the candidate embeddings for each feature to speed up the embedding lookup process; (ii) MLP component: we have two hidden layers with the size and , where varies with respect to different datasets and the training/test stage, and we use batch normalization () and ReLU activation for both hidden layers. The output layer is with Sigmoid activation. For the weights of the feature field, they are produced by a Softmax activation upon a trainable vector of length . We use an annealing schedule of temperature for Gumbelsoftmax, where is the training step. The learning rate for updating DLRS and weights are and , respectively, and the batchsize is set as 1000. For the parameters of the proposed AutoDim framework, we select them via crossvalidation. Correspondingly, we also do parametertuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections.
IiiC Evaluation Metrics
The performance is evaluated by AUC, Logloss and Space, where a higher AUC or a lower Logloss indicates a better recommendation performance. A lower Space means a lower space requirement. Area Under the ROC Curve (AUC) measures the probability that a positive instance will be ranked higher than a randomly chosen negative one; we introduce Logoss since all methods aim to optimize the logloss in Equation (10), thus it is natural to utilize Logloss as a straightforward metric. It is worth noting that a slightly higher AUC or lower Logloss at 0.1%level is regarded as significant for the CTR prediction task [7, 11]. For a specific model, the Space metric is the ratio of its embedding space requirement (in the testing stage) compared with that of MaD baseline detailed in the following subsection. We omit the space requirement of the MLP component to make the comparison clear, and the MLP component typically occupies only a small part of the total model space, e.g., in Criteo.
IiiD Overall Performance (RQ1)
We compare the proposed framework with the following representative baseline methods:

[leftmargin=*]

MinimalDimension (MiD): In this baseline, the embedding dimensions for all feature fields are set as the minimal size from the candidate set, i.e., 2.

MaximalDimension (MaD): In this baseline, we assign the same embedding dimensions to all feature fields. For each feature field, the embedding dimension is set as the maximal size from the candidate set, i.e., 10.

Random Search (RaS): Random search is strong baseline in neural network search [23]. We randomly allocate dimensions to each feature field in each time of experiments (10 times in total) and report the best performance.

Supervised Attention Model (
SAM): This baseline shares the same architecture with AutoDim, while we update the DLRS parameters and architectural weights (can be viewed as attention scores) simultaneously on the same training batch in an endtoend backpropagation fashion. It also derives discrete embedding dimensions.
The AutoDim/SAM models have four variants, i.e., 2 embedding lookup methods 2 transformation methods. We report their best AUC/Logloss and corresponding Space here, and will compare the variants in the following subsections. The overall results are shown in Table III. We can observe:

[leftmargin=*]

MiD achieves the worse recommendation performance than MaD, where MiD is assigned the minimal embedding dimensions to all feature fields, while MaD is allocated maximal ones. This result demonstrates that the performance is highly influenced by the embedding dimensions. Larger embedding sizes tend to enable the model to capture more characteristics from the features.

SAM outperforms RaS in terms of AUC/Logloss, where the embedding dimensions of SAM are determined by supervised attention scores, while the ones of RaS are randomly assigned. This observation proves that properly allocate different embedding dimensions to each feature field can boost the performance. However, SAM performs worse than MaD on all datasets, and save a little space, which means that its solution is suboptimal.

AutoDim performs better than SAM, because AutoMLbased models like AutoDim update the weights on the validation set, which can enhance the generalization, while supervised models like SAM update the weights
with DLRS on the same training batch simultaneously, which may lead to overfitting. SAM has much larger Space than AutoDim, which indicates that larger dimensions are more useful to minimize training loss. These results validate the effectiveness of AutoML techniques in recommendations over supervised learning.

AutoDim achieves comparable or slightly better AUC/Logloss than MaD, and saves significant space. This result validates that AutoDim indeed assigns smaller dimensions to nonpredictive features and larger dimensions to highpredictive features, which can not only keep/enhance the performance, but also can save space.
To sum up, we can draw an answer to the first question: compared with the representative baselines, AutoDim achieves comparable or slightly better recommendation performance than the best baseline, and saves significant space. These results prove the effectiveness of the AutoDim framework.
IiiE Component Analysis (RQ2)
In this paper, we propose two embedding lookup methods in Section IIB1 (i.e. separate embeddings v.s. weightsharing embeddings) and two transformation methods in Section IIB2 (i.e. linear transformation v.s. zeropadding transformation). In this section, we investigate their influence on performance. We systematically combine the corresponding model components by defining the following variants of AutoDim:

[leftmargin=*]

AutoDim1: In this variant, we use weightsharing embeddings and zeropadding transformation.

AutoDim2: This variant leverages weightsharing embeddings and linear transformation.

AutoDim3: We employ separate embeddings and zeropadding transformation in this variant.

AutoDim4: This variant utilizes separate embeddings and linear transformation.
The results on the Movielens1m dataset are shown in Figure 6. We omit similar results on other datasets due to the limited space. We make the following observations:

[leftmargin=*]

In Figure 6 (a), we compare the space complexity of total dimension search architecture of variants, i.e., all the candidate embeddings and the transformation neural networks shown in Figure 4 or 5. We can observe that AutoDim1 and AutoDim2 save significant space by introducing the weightsharing embedding architecture, which can benefit realworld recommenders where exist thousands of feature fields and the computing memory resources are expensive.

We compare the training speed of variants in Figure 6 (b). AutoDim1 and AutoDim3, which leverage zeropadding transformation, have a faster training speed because of the simpler architecture; while AutoDim2 and AutoDim4 run slower since lots of linear transformation computations. Note that we combine the candidate embeddings for each feature in the separate embedding setting, which reduces the number of embedding lookup times from to .

From Figure 6 (c), (d) and (e), variants with weightsharing embeddings have better performance than variants using separate embeddings. This is because the relatively front digits of its embedding space are more likely to be recalled and trained (as shown in Figure 3 (b)), which enable the framework capture more essential information in these digits, and make optimal dimension assignment selection.
In summary, we can answer the second question: the combination of weightsharing embedding and zeropadding transformation achieves the best performance in terms of not only the training speed and space complexity, but also the test AUC/Logloss/Space metrics.
IiiF Parameter Analysis (RQ3)
In this section, we investigate how the essential hyperparameters influence performance. Besides some common hyperparameters of deep recommender systems such as the number of hidden layers and the learning rate (we omit them due to limited space), our model has one particular hyperparameter, i.e., the frequency to update weights , referred as to . In Algorithm 1, we alternately update DLRS’s parameters on the training data and update weights on the validation data. In practice, we find that updating weights can be less frequently than updating DLRS’s parameters, which apparently reduces some computations, and also enhances the performance.
To study the impact of , we investigate how the AutoDim variants perform on Movielens1m dataset with the changes of , while fixing other parameters. Figure 7 shows the parameter sensitivity results, where in axis, means updating weights once, then updating DLRS’s parameters times. We can observe that the AutoDim achieves the optimal AUC/Logloss when . In other words, updating weights too frequently/infrequently results in suboptimal performance. Results on the other two datasets are similar, we omit them because of the limited space.
IiiG Case Study (RQ4)
In this section, we investigate how the AutoDim framework assigns embedding dimensions to different feature fields in the MovieLens1m dataset (feature fields are anonymous in Avazu and Criteo). The assignments of an experiment case () are shown in Table IV. The capitalized feature fields, e.g, Adventure, are binary fields of a particular genre. It can be observed that:

[leftmargin=*]

No feature fields are assigned 10dimensional embedding space, which means candidate embedding dimensions are sufficient to cover all possible choices. This is also the reason we do not analyze this hyperparameter in RQ3.

Compared with userId filed, the movieId filed is assigned a larger embedding dimension, which means movieId is more predictive. This phenomenon is reasonable: although having various personal biases, most users tend to provide higher ratings to movies that are universally considered to be of high quality, vice versa. In other words, the ratings are relatively more dependent on movies.

For the binary genre fields, we find some of them are assigned larger dimensions, e.g, Action, Crime, FilmNoir and Documentary, while the others are allocated the minimal dimension. Intuitively, this result means these four feature fields are more predictive and informative than the others. To demonstrate this inference, we compare the absolute difference of averaged rating between the items belongs to (or not) a specific genre :
where (or ) is the averaged rating of items belongs to (or not) field . We find that the average for Action, Crime, FilmNoir and Documentary is , while that of the other feature filed is , which validates that our proposed model indeed assigns larger dimensions to high predictive feature fields.
dimension  feature field  

2 


4  Documentary, gender, userId  
6  Action, Crime, FilmNoir  
8  movieId  
10   
IiiH Model Extension (RQ5)
In this subsection, we discuss how to employ AutoDim into stateoftheart deep recommender architectures. In Dimension Search stage, since AutoDim maps all features into the same dimension, i.e., , it is easily to involve it into most existing deep recommenders. We will mainly discuss the Parameter ReTraining stage in the following:

[leftmargin=*]

Wide&Deep [7]: This model is flexible to various embedding dimensions. Thus we only need to add the Wide component (i.e. a generalized linear model upon dense features) into our framework.

FM [31], DeepFM [11]: The FM (factorization machine) component requires all feature fields to share the same dimension since the interaction between any two fields is captured by the inner product of their embeddings. If different dimensions are selected in dimension search stage, in parameter retraining stage, we could first project the embeddings into to the same dimension via the Linear Transformation method we proposed in Section IIB2, where embeddings from the same feature field share the same weight matrice and bias vector. We do not recommend Zero Padding since it may lose information during the inner product from the padded zeros. Then we can train the DeepFM as the original. This logic can be applied to most deep recommenders, such as FFM [27], AFM [42], NFM [12], FNN [46], PNN [29], AutoInt [35], Deep&Cross [39] and xDeepFM [19].
In short, the proposed AutoDim can be easily involved into most existing representative deep learning based recommender systems, we leave it as a future work due to the limited space.
Iv Related Work
In this section, we will discuss the related works. We summarize the works related to our research from two perspectives, say, deep recommender systems and AutoML for neural architecture search.
Deep recommender systems have drawn increasing attention from both the academia and the industry thanks to its great advantages over traditional methods [44]. Various types of deep learning approaches in recommendation are developed. Sedhain et al. [34]
present an AutoEncoder based model named AutoRec. In their work, both itembased and userbased AutoRec are introduced. They are designed to capture the lowdimension feature embeddings of users and items, respectively. Hidasi et al.
[13] introduce an RNN based recommender system named GRU4Rec. In sessionbased recommendation, the model captures the information from items’ transition sequences for prediction. They also design a sessionparallel minibatches algorithm and a sampling method for output, which make the training process more efficient. Cheng et al. [7]introduce a Wide&Deep framework for both regression and classification tasks. The framework consists of a wide part, which is a linear model implemented as one layer of a feedforward neural network, and a deep part, which contains multiple perceptron layers to learn abstract and deep representations. Guo et al.
[11] propose the DeepFM model. It combines the factorization machine (FM) and MLP. The idea of it is to use the former to model the lowerorder feature interactions while using the latter to learn the higherorder interactions. Wang et al. [40] attempt to utilize CNN to extract visual features to help POI (PointofInterest) recommendations. They build a PMF based framework that models the interactions between visual information and latent user/location factors. Chen et al. [6] introduce hierarchical attention mechanisms into recommendation models. They propose a collaborative filtering model with an itemlevel and a componentlevel attention mechanism. The itemlevel attention mechanism captures user representations by attending various items and the componentlevel one tries to figure out the most important features from auxiliary sources for each user. Wang et al. [38] propose a generative adversarial network (GAN) based information retrieval model, IRGAN, which is applied in the task of recommendation, and also web search and question answering.The research of AutoML for neural architecture search can be traced back to NAS [50]
, which first utilizes an RNN based controller to design neural networks and proposes a reinforcement learning algorithm to optimize the framework. After that, many endeavors are conducted on reducing the high training cost of NAS. Pham et al.
[28] propose ENAS, where the controller learns to search a subgraph from a large computational graph to form an optimal neural network architecture. Brock et al. [3] introduce a framework named SMASH, in which a hypernetwork is developed to generate weights for sampled networks. DARTS [23] and SNAS [43] formulate the problem of network architecture search in a differentiable manner and solve it using gradient descent. Luo et al. [24] investigate representing network architectures as embeddings. Then they design a predictor to take the architecture embedding as input to predict its performance. They utilize gradientbased optimization to find an optimal embedding and decode it back to the network architecture. Some works raise another way of thinking, which is to limit the search space. The works [30, 48, 22, 4]focus on searching convolution cells, which are stacked repeatedly to form a convolutional neural network. Zoph et al.
[51]propose a transfer learning framework called NASNet, which train convolution cells on smaller datasets and apply them on larger datasets. Tan et al.
[36] introduce MNAS. They propose to search hierarchical convolution cell blocks in an independent manner, so that a deep network can be built based on them. Neural Input Search [16] and AutoEmb [47] are designed for tuning the embedding layer of deep recommender system. But they aim to tune the embedding sizes with in the same feature field, and are usually be used for userid/itemid features.V Conclusion
In this paper, we propose a novel framework AutoDim, which targets at automatically assigning different embedding dimensions to different feature fields in a datadriven manner. In realworld recommender systems, due to the huge amounts of feature fields and the highly complex relationships among embedding dimensions, feature distributions and neural network architectures, it is difficult, if possible, to manually allocate different dimensions to different feature fields. Thus, we proposed an AutoML based framework to automatically select from different embedding dimensions. To be specific, we first provide an endtoend differentiable model, which computes the weights over different dimensions for different feature fields simultaneously in a soft and continuous form, and we propose an AutoMLbased optimization algorithm; then according to the maximal weights, we derive a discrete embedding architecture, and retrain the DLRS parameters. We evaluate the AutoDim framework with extensive experiments based on widely used benchmark datasets. The results show that our framework can maintain or achieve slightly better performance with much fewer embedding space demands.
There are several interesting research directions. First, in addition to automatically select the embedding dimensions of categorical feature fields, we would also like to investigate the method to automatically handle numerical feature fields. Second, we would like to study the AutoMLbased method to automatically design the whole DLRS architecture including both MLP and embedding components. Third, our proposed select a unified embedding dimension for each feature field, in the future, we would like to develop a model which can assign various embedding dimensions to different items in the same feature field. Finally, the framework is quite general to address information retrieval problems, thus we would like to investigate more applications of the proposed framework.
References
 [1] (2015) Recommendations in locationbased social networks: a survey. Geoinformatica 19 (3), pp. 525–565. Cited by: §I, §I.

[2]
(1998)
Empirical analysis of predictive algorithms for collaborative filtering.
In
Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence
, pp. 43–52. Cited by: §I.  [3] (2017) Smash: oneshot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §IV.
 [4] (2018) Pathlevel network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639. Cited by: §IV.
 [5] (2006) A survey of web information extraction systems. IEEE transactions on knowledge and data engineering 18 (10), pp. 1411–1428. Cited by: §I.
 [6] (2017) Attentive collaborative filtering: multimedia recommendation with itemand componentlevel attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 335–344. Cited by: §IV.
 [7] (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §I, §IIB3, item 1, §IIIC, §IV.
 [8] (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §I.
 [9] (2019) Mixed dimension embeddings with application to memoryefficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §I, §I.
 [10] (1948) Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §IIB3.
 [11] (2017) DeepFM: a factorizationmachine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1725–1731. Cited by: item 2, §IIIC, §IV.
 [12] (2017) Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: item 2.

[13]
(2015)
Sessionbased recommendations with recurrent neural networks
. arXiv preprint arXiv:1511.06939. Cited by: §IV. 
[14]
(2015)
Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
International Conference on Machine Learning
, pp. 448–456. Cited by: §IIB2.  [15] (2016) Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144. Cited by: §IIB3, §IIB3.
 [16] (2019) Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471. Cited by: §I, §IV.
 [17] (2020) Learning multigranular quantized embeddings for largevocab categorical features in recommender systems. arXiv preprint arXiv:2002.08530. Cited by: §I.
 [18] (2015) Deep collaborative filtering via marginalized denoising autoencoder. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 811–820. Cited by: §I.
 [19] (2018) Xdeepfm: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: item 2.
 [20] (2003) Amazon. com recommendations: itemtoitem collaborative filtering. IEEE Internet computing 7 (1), pp. 76–80. Cited by: §I.
 [21] (2020) AutoFIS: automatic feature interaction selection in factorization models for clickthrough rate prediction. arXiv preprint arXiv:2003.11235. Cited by: §IIIA.

[22]
(2018)
Progressive neural architecture search.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 19–34. Cited by: §IV.  [23] (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §IIA1, §IIC, 3rd item, §IV.
 [24] (2018) Neural architecture optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7827–7838. Cited by: §IV.
 [25] (2000) Contentbased book recommending using learning for text categorization. In Proceedings of the fifth ACM conference on Digital libraries, pp. 195–204. Cited by: §I.
 [26] (2017) Personalized deep learning for tag recommendation. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 186–197. Cited by: §I.
 [27] (2018) Fieldweighted factorization machines for clickthrough rate prediction in display advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 1349–1357. Cited by: §I, item 2.
 [28] (2018) Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §IIA1, §IIC, §IV.
 [29] (2016) Productbased neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1149–1154. Cited by: §I, item 2.

[30]
(2019)
Regularized evolution for image classifier architecture search
. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §IV.  [31] (2010) Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000. Cited by: item 2.
 [32] (1997) Recommender systems. Communications of the ACM 40 (3), pp. 56–58. Cited by: §I.
 [33] (2011) Introduction to recommender systems handbook. In Recommender systems handbook, pp. 1–35. Cited by: §I.
 [34] (2015) Autorec: autoencoders meet collaborative filtering. In Proceedings of the 24th international conference on World Wide Web, pp. 111–112. Cited by: §IV.
 [35] (2019) Autoint: automatic feature interaction learning via selfattentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: item 2.

[36]
(2019)
Mnasnet: platformaware neural architecture search for mobile.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2820–2828. Cited by: §IV.  [37] (2016) Improved recurrent neural networks for sessionbased recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 17–22. Cited by: §IIB3.
 [38] (2017) Irgan: a minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 515–524. Cited by: §IV.
 [39] (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 1–7. Cited by: item 2.
 [40] (2017) What your images reveal: exploiting visual contents for pointofinterest recommendation. In Proceedings of the 26th International Conference on World Wide Web, pp. 391–400. Cited by: §IV.
 [41] (2016) Personal recommendation using deep recurrent neural networks in netease. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pp. 1218–1229. Cited by: §I.
 [42] (2017) Attentional factorization machines: learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617. Cited by: item 2.
 [43] (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §IV.
 [44] (2019) Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR) 52 (1), pp. 1–38. Cited by: §I, §IV.
 [45] (2017) Deep learning based recommender system: a survey and new perspectives. arXiv preprint arXiv:1707.07435. Cited by: §I.
 [46] (2016) Deep learning over multifield categorical data. In European conference on information retrieval, pp. 45–57. Cited by: item 2.
 [47] (2020) AutoEmb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §I, §IV.
 [48] (2018) Practical blockwise neural network architecture generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §IV.
 [49] (2018) Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §I.
 [50] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §IV.
 [51] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §IV.