Memory-efficient Embedding for Recommendations

06/26/2020 ∙ by Xiangyu Zhao, et al. ∙ Southern Methodist University 0

Practical large-scale recommender systems usually contain thousands of feature fields from users, items, contextual information, and their interactions. Most of them empirically allocate a unified dimension to all feature fields, which is memory inefficient. Thus it is highly desired to assign different embedding dimensions to different feature fields according to their importance and predictability. Due to the large amounts of feature fields and the nuanced relationship between embedding dimensions with feature distributions and neural network architectures, manually allocating embedding dimensions in practical recommender systems can be very difficult. To this end, we propose an AutoML based framework (AutoDim) in this paper, which can automatically select dimensions for different feature fields in a data-driven fashion. Specifically, we first proposed an end-to-end differentiable framework that can calculate the weights over various dimensions for feature fields in a soft and continuous manner with an AutoML based optimization algorithm; then we derive a hard and discrete embedding component architecture according to the maximal weights and retrain the whole recommender framework. We conduct extensive experiments on benchmark datasets to validate the effectiveness of the AutoDim framework.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the explosive growth of the world-wide web, huge amounts of data have been generated, which results in the increasingly severe information overload problem, potentially overwhelming users [5]. Recommender systems can mitigate the information overload problem through suggesting personalized items that best match users’ preferences [20, 2, 25, 32, 33, 1]

. Recent years have witnessed the increased development and popularity of deep learning based recommender systems (DLRSs) 

[45, 26, 41], which outperform traditional recommendation techniques, such as collaborative filtering and learning-to-rank, because of their strong capability of feature representation and deep inference [44].

Fig. 1: The Typically DLRS architecture.
Fig. 2: Overview of the proposed framework.

Real-world recommender systems typically involve a massive amount of categorical feature fields from users (e.g. occupation and userID), items (e.g. category and itemID), contextual information (e.g. time and location), and their interactions (e.g. user’s purchase history of items). DLRSs first map these categorical features into real-valued dense vectors via an 

embedding-component [27, 29, 49], i.e., the embedding-lookup process, which leads to huge amounts of embedding parameters. For instance, the YouTube recommender system consists of 1 million of unique videoIDs, and assign each videoID with a specific 256-dimensional embedding vector; in other words, the videoID feature field alone occupies 256 million parameters [8]. Then, the DLRSs nonlinearly transform the input embeddings form all feature fields and generate the outputs (predictions) via the MLP-component

(Multi-Layer Perceptron), which usually involves only several fully connected layers in practice. Therefore, compared to the MLP-component, the embedding-component dominates the number of parameters in practical recommender systems, which naturally plays a tremendously impactful role in the recommendations.

The majority of existing recommender systems assign fixed and unified embedding dimension for all feature fields, such as the famous Wide&Deep model [7], which may lead to memory inefficiency. First, the embedding dimension often determines the capacity to encode information. Thus, allocating the same dimension to all feature fields may lose the information of high predictive features while wasting memory on non-predictive features. Therefore, we should assign large dimension to the high informative and predictive features, for instance, the “location” feature in location-based recommender systems [1]. Second, different feature fields have different cardinality (i.e. the number of unique values). For example, the gender feature has only two (i.e. male and female), while the itemID feature usually involves millions of unique values. Intuitively, we should allocate larger dimensions to the feature fields with more unique feature values to encode their complex relationships with other features, and assign smaller dimensions to feature fields with smaller cardinality to avoid the overfitting problem due to the over-parameterization [16, 9, 47, 17]. According to the above reasons, it is highly desired to assign different embedding dimensions to different feature fields in a capacity and memory-efficient manner.

In this paper, we aim to enable different embedding dimensions for different feature fields for recommendations. We face tremendous challenges. First, the relationship among embedding dimensions, feature distributions and neural network architectures is highly intricate, which makes it hard to manually assign embedding dimensions to each feature field [9]. Second, real-world recommender systems often involve hundreds and thousands of feature fields. It is difficult, if possible, to artificially select different dimensions for each feature field via traditional techniques (e.g. auto-encoder [18]), due to the huge computation cost from the numerous feature-dimension combinations. Our attempt to address these challenges results in an end-to-end differentiable AutoML based framework (AutoDim), which can efficiently allocate embedding dimensions to different feature fields in an automated and data-driven manner. Our experiments on benchmark datasets demonstrate the effectiveness of the proposed framework. We summarize our major contributions as: (i) we propose an end-to-end AutoML based framework AutoDim, which can automatically select various embedding dimensions to different feature fields; (ii) we develop two embedding lookup methods and two embedding transformation approaches, and compare the impact of their combinations on the embedding dimension allocation decision; and (iii) we demonstrate the effectiveness of the proposed framework on real-world benchmark datasets.

The rest of this paper is organized as follows. In Section 2, we introduce details about how to assign various embedding dimensions for different feature fields in an automated and data-driven fashion, and propose an AutoML based optimization algorithm. Section 3 carries out experiments based on real-world datasets and presents experimental results. Section 4 briefly reviews related work. Finally, section 5 concludes this work and discusses our future work.

Ii Framework

In order to achieve the automated allocation of different embedding dimensions to different feature fields, we propose an AutoML based framework, which effectively addresses the challenges we discussed in Section I. In this section, we will first introduce the overview of the whole framework; then we will propose an end-to-end differentiable model with two embedding-lookup methods and two embedding dimension search methods, which can compute the weights of different dimensions for feature fields in a soft and continuous fashion, and we will provide an AutoML based optimization algorithm; finally, we will derive a discrete embedding architecture upon the maximal weights, and retrain the whole DLRS framework.

Notation Definition
feature from the feature field
embedding space of the feature field
dimension of the embedding space
candidate embedding of feature
-embedding in the weight-sharing method
mini-batch mean

mini-batch variance

constant for numerical stability
probability to select candidate dimension
embedding of feature to be fed into MLP
parameters of DLRS
weights on different embedding spaces
Linear Transformation
embedding vector after linear transformation
weight matrix of linear transformation
bias vector of linear transformation

embedding vector after batch-normalization

Zero Padding

embedding vector after batch-normalization
embedding vector after zero padding
TABLE I: Main Notations

Ii-a Overview

Our goal is to assign different feature fields various embedding dimensions in an automated and data-driven manner, so as to enhance the memory efficiency and the performance of the recommender system. We illustrate the overall framework in Figure 2, which consists of two major stages:

Ii-A1 Dimensionality search stage

It aims to find the optimal embedding dimension for each feature field. To be more specific, we first assign a set of candidate embeddings with different dimensions to a specific categorical feature via an embedding-lookup step; then, we unify the dimensions of these candidate embeddings through a transformation step, which is because of the fixed input dimension of the first MLP layer; next, we obtain the formal embedding for this categorical feature by computing the weighted sum of all its transformed candidate embeddings, and feed it into the MLP-component. The DLRS parameters including the embeddings and MLP layers are learned upon the training set, while the architectural weights over the unified candidate embeddings are optimized upon the validation set, which prevents the framework selecting the embedding dimensions that overfit the training set [28, 23].

Ii-A2 Parameter re-training stage

According to the architectural weights learned in dimensionality search, we select the embedding dimension for each feature field, and re-train the parameters of DLRS parameters (i.e. embeddings and MLPs) on the training dataset in an end-to-end fashion.

Table I summarizes the key notations of this work. Note that numerical features will be converted into categorical features through bucketing, and we omit this process in the following sections for simplicity. Next, we will introduce the details of each stage.

Fig. 3: The embedding lookup methods.

Ii-B Dimensionality Search

As discussed in Section I, different feature fields have different cardinalities and various contributions to the final prediction. Inspired by this phenomenon, it is highly desired to enable various embedding dimensions for different feature fields. However, due to the large amount of feature fields and the complex relationship between embedding dimensions with feature distributions and neural network architectures, it is difficult to manually select embedding dimensions via conventional dimension reduction methods. An intuitive solution to tackle this challenge is to assign several embedding spaces with various dimensions to each feature field, and then the DLRS automatically selects the optimal embedding dimension for each feature field.

Fig. 4: Method 1 - Linear Transformation.

Ii-B1 Embedding Lookup Tricks

Suppose for each user-item interaction instance, we have input features , and each feature belongs to a specific feature field, such as gender and age, etc. For the feature field, we assign embedding spaces . The dimension of an embedding in each space is , where ; and the cardinality of these embedding spaces are the number of unique feature values in this feature field. Correspondingly, we define as the set of candidate embeddings for a given feature from all embedding spaces, as shown in Figure 3 (a). Note that we assign the same candidate dimension to all feature fields for simplicity, but it is straightforward to introduce different candidate sets. Therefore, the total space assigned to the feature is . However, in real-world recommender systems with thousands of feature fields, two challenges lie in this design include (i) this design needs huge space to store all candidate embeddings, and (ii) the training efficiency is reduced since a large number of parameters need to be learned.

To address these challenges, we propose an alternative solution for large-scale recommendations, named weight-sharing embedding architecture. As illustrated in Figure 3 (b), we only allocate a -dimensional embedding to a given feature , referred as to , then the candidate embedding corresponds to the first digits of . The advantages associated with weight-sharing embedding method are two-fold, i.e., (i) it is able to reduce the storage space and increase the training efficiency, as well as (ii) since the relatively front digits of have more chances to be retrieved and then be trained (e.g. the “red part” of is leveraged by all candidates in Figure 3 (b)), we intuitively wish they can capture more essential information of the feature .

Ii-B2 Unifying Various Dimensions

Since the input dimension of the first MLP layer in existing DLRSs is often fixed, it is difficult for them to handle various candidate dimensions. Thus we need to unify the embeddings into same dimension, and we develop two following methods:

Method 1: Linear Transformation

Figure 4 (a) illustrates the linear transformation method to handle the various embedding dimensions (the difference of two embedding lookup methods is omitted here). We introduce fully-connected layers, which transform embedding vectors into the same dimension :


where is weight matrice and is bias vector. With the linear transformations, we map the original embedding vectors into the same dimensional space, i.e., . In practice, we can observe that the magnitude of the transformed embeddings varies significantly, which makes them become incomparable. To tackle this challenge, we conduct BatchNorm [14] on the transformed embeddings as:


where is the mini-batch mean and is the mini-batch variance for . is a constant added to the mini-batch variance for numerical stability. After BatchNorm, the linearly transformed embeddings become to magnitude-comparable embedding vectors with the same dimension .

Fig. 5: Method 2 - Zero Padding Transformation.
Method 2: Zero Padding

Inspired by zero padding techniques from the computer version community, which pads the input volume with zeros around the border, we address the problem of various embedding dimensions by padding shorter embedding vectors to the same length as the longest embedding dimension with zeros, which is illustrated in Figure 4 (b). For the embedding vectors with different dimensions, we first execute BatchNorm process, which forces the original embeddings into becoming magnitude-comparable embeddings:


where , are the mini-batch mean and variance. is the constant for numerical stability. The transformed are magnitude-comparable embeddings. Then we pad the to the same length by zeros:


where the second term of each padding formula is the number of zeros to be padded with the embedding vector of the first term. Then the embeddings share the same dimension . Compared with the linear transformation (method 1), the zero padding method reduces lots of linear-transformation computations and corresponding parameters. The possible drawback is that the final embeddings becomes spatially unbalanced since the tail parts of some final embeddings are zeros. Next, we will introduce embedding dimension selection process.

Ii-B3 Dimension Selection

In this paper, we aim to select the optimal embedding dimension for each feature field in an automated and data-driven manner. This is a hard (categorical) selection on the embedding spaces, which will make the whole framework not end-to-end differentiable. To tackle this challenge, in this work, we approximate the hard selection over different dimensions via introducing the Gumbel-softmax operation [15], which simulates the non-differentiable sampling from a categorical distribution by a differentiable sampling from the Gumbel-softmax distribution.

To be specific, suppose weights are the class probabilities over different dimensions. Then a hard selection can be drawn via the the gumbel-max trick [10] as:


The gumbel noises are i.i.d samples, which perturb terms and make the operation that is equivalent to drawing a sample by weights. However, this trick is non-differentiable due to the operation. To deal with this problem, we use the softmax function as a continuous, differentiable approximation to operation, i.e., straight-through gumbel-softmax [15]:


where is the temperature parameter, which controls the smoothness of the output of gumbel-softmax operation. When approaches zero, the output of the gumbel-softmax becomes closer to a one-hot vector. Then is the probability of selecting the candidate embedding dimension for the feature , and its embedding can be formulated as the weighted sum of :


We illustrate the weighted sum operations in Figure 4 and 5. With gumbel-softmax operation, the dimensionality search process is end-to-end differentiable. The discrete embedding dimension selection conducted based on the weights will be detailed in the following subsections.

Then, we concatenate the embeddings and feed input into multilayer perceptron layers:


where and are the weight matrix and the bias vector for the MLP layer.

is the activation function such as

ReLU and Tanh. Finally, the output layer that is subsequent to the last MLP layer, produces the prediction of the current user-item interaction instance as:


where and are the weight matrix and bias vector for the output layer. Activation function

is selected based on different recommendation tasks, such as Sigmoid function for regression 

[7], and Softmax for multi-class classification [37]. Correspondingly, the objective function between prediction and ground truth label also varies based on different recommendation tasks. In this work, we leverage negative log-likelihood function:


where is the ground truth (1 for like or click, 0 for dislike or non-click). By minimizing the objective function , the dimensionality search framework updates the parameters of all embeddings, hidden layers, and weights through back-propagation. The high-level idea of the dimensionality search is illustrated in Figure 2 (a), where we omit some details of embedding-lookup, transformations and gumbel-softmax for the sake of simplicity.

Ii-C Optimization

In this subsection, we will detail the optimization method of the proposed AutoDim framework. In AutoDim, we formulate the selection over different embedding dimensions as an architectural optimization problem and make it end-to-end differentiable by leveraging the Gumbel-softmax technique. The parameters to be optimized in AutoDim are two-fold, i.e., (i) : the parameters of the DLRS, including the embedding-component and the MLP-component; (ii) : the weights on different embedding spaces ( are calculated based on as in Equation (6)). DLRS parameters and architectural weights can not be optimized simultaneously on training dataset as conventional supervised attention mechanism since the optimization of them are highly dependent on each other. In other words, simultaneously optimization may result in overfitting on the examples from the training dataset.

Our optimization method is based on the differentiable architecture search (DARTS) techniques [23], where and are alternately optimized through gradient descent. Specifically, we alternately update by optimizing the loss on the training data and update by optimizing the loss on the validation data:


this optimization forms a bilevel optimization problem [28], where architectural weights and DLRS parameters are identified as the upper-level variable and lower-level variable. Since the inner optimization of is computationally expensive, directly optimizing via Eq.(11) is intractable. To address this challenge, we take advantage of the approximation scheme of DARTS:

Input: the features of user-item interactions and the corresponding ground-truth labels
Output: the well-learned DLRS parameters ; the well-learned weights on various embedding spaces

1:  while not converged do
2:     Sample a mini-batch of user-item interactions from validation data
3:     Update by descending with the approximation in Eq.(12)
4:     Collect a mini-batch of training data
5:     Generate predictions via DLRS with current and architectural weights
6:     Update by descending
7:  end while
Algorithm 1 DARTS based Optimization for AutoDim.

where is the learning rate. In the approximation scheme, when updating via Eq.(12

), we estimate

by descending the gradient for only one step, rather than to optimize thoroughly to obtain . In practice, it usually leverages the first-order approximation by setting , which can further enhance the computation efficiency. The DARTS based optimization algorithm for AutoDim is detailed in Algorithm 1. Specifically, in each iteration, we first sample a batch of user-item interaction data from the validation set (line 2); next, we update the architectural weights upon it (line 3); afterward, the DLRS make the predictions on the batch of training data with current DLRS parameters and architectural weights (line 5); eventually, we update the DLRS parameters by descending (line 6).

Ii-D Parameter Re-Training

In this subsection, we will introduce how to select optimal embedding dimension for each feature field and the details of re-training the recommender system with the selected embedding dimensions.

Ii-D1 Deriving Discrete Dimensions

During re-training, the gumbel-softmax operation is no longer used, which means that the optimal embedding space (dimension) are selected for each feature field as the one corresponding to the largest weight, based on the well-learned . It is formally defined as:


Figure 2 (a) illustrates the architecture of AutoDim framework with a toy example about the optimal dimension selections based on two candidate dimensions, where the largest weights corresponding to the , and feature fields are , and , then the embedding space , and are selected for these feature fields. The dimension of an embedding vector in these embedding spaces is , and , respectively.

Ii-D2 Model Re-training

As shown in Figure 2 (b), given the selected embedding spaces, we can obtain unique embedding vectors for features . Then we concatenate these embeddings and feeds them into hidden layers. Next, the prediction

is generated by the output layer. Finally, all the parameters of the DLRS, including embeddings and MLPs, will be updated via minimizing the supervised loss function

through back-propagation. The model re-training algorithm is detailed in Algorithm 2. Note that, (i) the re-training process is based on the same training data as Algorithm 1, (ii) the input dimension of the first hidden layer is adjusted according to the new embedding dimensions in model re-training stage.

Input: the features of user-item interactions and the corresponding ground-truth labels
Output: the well-learned DLRS parameters

1:  while not converged do
2:     Sample a mini-batch of training data
3:     Generate predictions via DLRS with current
4:     Update by descending
5:  end while
Algorithm 2 The Optimization of DLRS Re-training Process.
Data MovieLens-1m Avazu Criteo
# Interactions 1,000,209 13,730,132 23,490,876
# Feature Fields 26 22 39
# Sparse Features 13,749 87,249 373,159
# Pos Ratio 0.58 0.5 0.5
TABLE II: Statistics of the datasets.

Iii Experiments

In this section, we first introduce experimental settings. Then we conduct extensive experiments to evaluate the effectiveness of the proposed AutoDim framework. We mainly seek answers to the following research questions - RQ1: How does AutoDim perform compared with representative baselines? RQ2: How do the components, i.e., 2 embedding lookup methods and 2 transformation methods, influence the performance? RQ3: What is the impact of important parameters on the results? RQ4: Which features are assigned large embedding dimension? RQ5: Can the proposed AutoDim be utilized by other widely used deep recommender systems?

Iii-a Datasets

We evaluate our model on widely used benchmark datasets:

  • [leftmargin=*]

  • MovieLens-1m111

    : This is a benchmark for evaluating recommendation algorithms, which contains users’ ratings on movies. The dataset includes 6,040 users and 3,416 movies, where each user has at least 20 ratings. We binarize the ratings into a binary classification task, where ratings of 4 and 5 are viewed as positive and the rest as negative. After preprocessing, there are 26 categorical feature fields.

  • Avazu222 Avazu dataset was provided for the CTR prediction challenge on Kaggle, which contains 11 days’ user clicking behaviors that whether a displayed mobile ad impression is clicked or not. There are 22 categorical feature fields including user/ad features and device attributes. Parts of the fields are anonymous.

  • Criteo333 This is a benchmark industry dataset for the purpose of evaluating ad click-through rate prediction models. It consists of 45 million users’ click records on displayed ads over one month. For each data example, it contains 13 numerical feature fields and 26 categorical feature fields. We normalize numerical features by transforming a value if as proposed by the Criteo Competition winner 444 r01922136/kaggle-2014-criteo.pdf, and then convert it into categorical features through bucketing.

Since the labels of Criteo and Avazu are extremely imbalanced, we conduct down-sampling on negative samples to keep the positive ratio at 50%. Features in a specific field appearing less than 30 times are treated as a special dummy feature [21]. Some key statistics of the datasets are shown in Table II. For each dataset, we use 90% user-item interactions as the training/validation set and the rest 10% as the test set.

Datasets Metrics Methods
MiD MaD RaS SAM AutoDim
AUC % 76.91 0.033 77.42 0.061 76.96 0.127 77.12 0.056 77.61 0.056
Logloss 0.570 0.002 0.565 0.002 0.569 0.006 0.567 0.003 0.561 0.002
Space % 20 100 61.16 12.12 88.82 5.721 36.20 7.635
Avazu AUC % 74.33 0.034 74.61 0.025 74.52 0.046 74.59 0.027 74.70 0.035
Logloss 0.593 0.003 0.591 0.002 0.593 0.004 0.592 0.003 0.587 0.002
Space % 20 100 56.75 9.563 95.92 2.355 29.60 3.235
Criteo AUC % 76.72 0.008 77.53 0.010 77.16 0.142 77.27 0.007 77.51 0.009
Logloss 0.576 0.003 0.568 0.002 0.572 0.002 0.571 0.001 0.569 0.002
Space % 20 100 65.36 15.22 93.45 5.536 43.97 9.432
TABLE III: Performance comparison of different embedding search methods

Iii-B Implement Details

Next we detail the AutoDim architectures. For the DLRS, (i) embedding component: for each feature field, we select from candidate embedding dimensions , thus dimension of transformed embedding is . In the separate embedding setting, we concatenate the all the candidate embeddings for each feature to speed up the embedding lookup process; (ii) MLP component: we have two hidden layers with the size and , where varies with respect to different datasets and the training/test stage, and we use batch normalization () and ReLU activation for both hidden layers. The output layer is with Sigmoid activation. For the weights of the feature field, they are produced by a Softmax activation upon a trainable vector of length . We use an annealing schedule of temperature for Gumbel-softmax, where is the training step. The learning rate for updating DLRS and weights are and , respectively, and the batch-size is set as 1000. For the parameters of the proposed AutoDim framework, we select them via cross-validation. Correspondingly, we also do parameter-tuning for baselines for a fair comparison. We will discuss more details about parameter selection for the proposed framework in the following subsections.

Iii-C Evaluation Metrics

The performance is evaluated by AUC, Logloss and Space, where a higher AUC or a lower Logloss indicates a better recommendation performance. A lower Space means a lower space requirement. Area Under the ROC Curve (AUC) measures the probability that a positive instance will be ranked higher than a randomly chosen negative one; we introduce Logoss since all methods aim to optimize the logloss in Equation (10), thus it is natural to utilize Logloss as a straightforward metric. It is worth noting that a slightly higher AUC or lower Logloss at 0.1%-level is regarded as significant for the CTR prediction task [7, 11]. For a specific model, the Space metric is the ratio of its embedding space requirement (in the testing stage) compared with that of MaD baseline detailed in the following subsection. We omit the space requirement of the MLP component to make the comparison clear, and the MLP component typically occupies only a small part of the total model space, e.g., in Criteo.

Iii-D Overall Performance (RQ1)

We compare the proposed framework with the following representative baseline methods:

  • [leftmargin=*]

  • Minimal-Dimension (MiD): In this baseline, the embedding dimensions for all feature fields are set as the minimal size from the candidate set, i.e., 2.

  • Maximal-Dimension (MaD): In this baseline, we assign the same embedding dimensions to all feature fields. For each feature field, the embedding dimension is set as the maximal size from the candidate set, i.e., 10.

  • Random Search (RaS): Random search is strong baseline in neural network search [23]. We randomly allocate dimensions to each feature field in each time of experiments (10 times in total) and report the best performance.

  • Supervised Attention Model (


    ): This baseline shares the same architecture with AutoDim, while we update the DLRS parameters and architectural weights (can be viewed as attention scores) simultaneously on the same training batch in an end-to-end backpropagation fashion. It also derives discrete embedding dimensions.

The AutoDim/SAM models have four variants, i.e., 2 embedding lookup methods 2 transformation methods. We report their best AUC/Logloss and corresponding Space here, and will compare the variants in the following subsections. The overall results are shown in Table III. We can observe:

  1. [leftmargin=*]

  2. MiD achieves the worse recommendation performance than MaD, where MiD is assigned the minimal embedding dimensions to all feature fields, while MaD is allocated maximal ones. This result demonstrates that the performance is highly influenced by the embedding dimensions. Larger embedding sizes tend to enable the model to capture more characteristics from the features.

  3. SAM outperforms RaS in terms of AUC/Logloss, where the embedding dimensions of SAM are determined by supervised attention scores, while the ones of RaS are randomly assigned. This observation proves that properly allocate different embedding dimensions to each feature field can boost the performance. However, SAM performs worse than MaD on all datasets, and save a little space, which means that its solution is suboptimal.

  4. AutoDim performs better than SAM, because AutoML-based models like AutoDim update the weights on the validation set, which can enhance the generalization, while supervised models like SAM update the weights

    with DLRS on the same training batch simultaneously, which may lead to overfitting. SAM has much larger Space than AutoDim, which indicates that larger dimensions are more useful to minimize training loss. These results validate the effectiveness of AutoML techniques in recommendations over supervised learning.

  5. AutoDim achieves comparable or slightly better AUC/Logloss than MaD, and saves significant space. This result validates that AutoDim indeed assigns smaller dimensions to non-predictive features and larger dimensions to high-predictive features, which can not only keep/enhance the performance, but also can save space.

To sum up, we can draw an answer to the first question: compared with the representative baselines, AutoDim achieves comparable or slightly better recommendation performance than the best baseline, and saves significant space. These results prove the effectiveness of the AutoDim framework.

Fig. 6: Component analysis on Movielens-1m dataset.

Iii-E Component Analysis (RQ2)

In this paper, we propose two embedding lookup methods in Section II-B1 (i.e. separate embeddings v.s. weight-sharing embeddings) and two transformation methods in Section II-B2 (i.e. linear transformation v.s. zero-padding transformation). In this section, we investigate their influence on performance. We systematically combine the corresponding model components by defining the following variants of AutoDim:

  • [leftmargin=*]

  • AutoDim-1: In this variant, we use weight-sharing embeddings and zero-padding transformation.

  • AutoDim-2: This variant leverages weight-sharing embeddings and linear transformation.

  • AutoDim-3: We employ separate embeddings and zero-padding transformation in this variant.

  • AutoDim-4: This variant utilizes separate embeddings and linear transformation.

The results on the Movielens-1m dataset are shown in Figure 6. We omit similar results on other datasets due to the limited space. We make the following observations:

  1. [leftmargin=*]

  2. In Figure 6 (a), we compare the space complexity of total dimension search architecture of variants, i.e., all the candidate embeddings and the transformation neural networks shown in Figure 4 or 5. We can observe that AutoDim-1 and AutoDim-2 save significant space by introducing the weight-sharing embedding architecture, which can benefit real-world recommenders where exist thousands of feature fields and the computing memory resources are expensive.

  3. We compare the training speed of variants in Figure 6 (b). AutoDim-1 and AutoDim-3, which leverage zero-padding transformation, have a faster training speed because of the simpler architecture; while AutoDim-2 and AutoDim-4 run slower since lots of linear transformation computations. Note that we combine the candidate embeddings for each feature in the separate embedding setting, which reduces the number of embedding lookup times from to .

  4. We illustrate test AUC/Logloss in Figure 6 (c) and (d). It is observed that AutoDim-1 achieves the optimal performance among all the variants (its results are reported in Table III). In Figure 6 (e), we can also find that the discrete embedding dimensions generated by AutoDim-1 save most space.

  5. From Figure 6 (c), (d) and (e), variants with weight-sharing embeddings have better performance than variants using separate embeddings. This is because the relatively front digits of its embedding space are more likely to be recalled and trained (as shown in Figure 3 (b)), which enable the framework capture more essential information in these digits, and make optimal dimension assignment selection.

In summary, we can answer the second question: the combination of weight-sharing embedding and zero-padding transformation achieves the best performance in terms of not only the training speed and space complexity, but also the test AUC/Logloss/Space metrics.

Fig. 7: Parameter analysis on Movielens-1m dataset.

Iii-F Parameter Analysis (RQ3)

In this section, we investigate how the essential hyper-parameters influence performance. Besides some common hyper-parameters of deep recommender systems such as the number of hidden layers and the learning rate (we omit them due to limited space), our model has one particular hyper-parameter, i.e., the frequency to update weights , referred as to . In Algorithm 1, we alternately update DLRS’s parameters on the training data and update weights on the validation data. In practice, we find that updating weights can be less frequently than updating DLRS’s parameters, which apparently reduces some computations, and also enhances the performance.

To study the impact of , we investigate how the AutoDim variants perform on Movielens-1m dataset with the changes of , while fixing other parameters. Figure 7 shows the parameter sensitivity results, where in -axis, means updating weights once, then updating DLRS’s parameters times. We can observe that the AutoDim achieves the optimal AUC/Logloss when . In other words, updating weights too frequently/infrequently results in suboptimal performance. Results on the other two datasets are similar, we omit them because of the limited space.

Iii-G Case Study (RQ4)

In this section, we investigate how the AutoDim framework assigns embedding dimensions to different feature fields in the MovieLens-1m dataset (feature fields are anonymous in Avazu and Criteo). The assignments of an experiment case () are shown in Table IV. The capitalized feature fields, e.g, Adventure, are binary fields of a particular genre. It can be observed that:

  1. [leftmargin=*]

  2. No feature fields are assigned 10-dimensional embedding space, which means candidate embedding dimensions are sufficient to cover all possible choices. This is also the reason we do not analyze this hyper-parameter in RQ3.

  3. Compared with userId filed, the movieId filed is assigned a larger embedding dimension, which means movieId is more predictive. This phenomenon is reasonable: although having various personal biases, most users tend to provide higher ratings to movies that are universally considered to be of high quality, vice versa. In other words, the ratings are relatively more dependent on movies.

  4. For the binary genre fields, we find some of them are assigned larger dimensions, e.g, Action, Crime, Film-Noir and Documentary, while the others are allocated the minimal dimension. Intuitively, this result means these four feature fields are more predictive and informative than the others. To demonstrate this inference, we compare the absolute difference of averaged rating between the items belongs to (or not) a specific genre :

    where (or ) is the averaged rating of items belongs to (or not) field . We find that the average for Action, Crime, Film-Noir and Documentary is , while that of the other feature filed is , which validates that our proposed model indeed assigns larger dimensions to high predictive feature fields.

dimension feature field
Adventure, Animation, Children’s, Comedy,
Drama, Fantasy, Horror, Musical, Mystery,
Romance, Sci-Fi, Thriller, War, Western,
year, timestamp, age, occupation, zip
4 Documentary, gender, userId
6 Action, Crime, Film-Noir
8 movieId
10 -
TABLE IV: Embedding dimensions for Movielens-1m

Iii-H Model Extension (RQ5)

In this subsection, we discuss how to employ AutoDim into state-of-the-art deep recommender architectures. In Dimension Search stage, since AutoDim maps all features into the same dimension, i.e., , it is easily to involve it into most existing deep recommenders. We will mainly discuss the Parameter Re-Training stage in the following:

  1. [leftmargin=*]

  2. Wide&Deep [7]: This model is flexible to various embedding dimensions. Thus we only need to add the Wide component (i.e. a generalized linear model upon dense features) into our framework.

  3. FM [31], DeepFM [11]: The FM (factorization machine) component requires all feature fields to share the same dimension since the interaction between any two fields is captured by the inner product of their embeddings. If different dimensions are selected in dimension search stage, in parameter re-training stage, we could first project the embeddings into to the same dimension via the Linear Transformation method we proposed in Section II-B2, where embeddings from the same feature field share the same weight matrice and bias vector. We do not recommend Zero Padding since it may lose information during the inner product from the padded zeros. Then we can train the DeepFM as the original. This logic can be applied to most deep recommenders, such as FFM [27], AFM [42], NFM [12], FNN [46], PNN [29], AutoInt [35], Deep&Cross [39] and xDeepFM [19].

In short, the proposed AutoDim can be easily involved into most existing representative deep learning based recommender systems, we leave it as a future work due to the limited space.

Iv Related Work

In this section, we will discuss the related works. We summarize the works related to our research from two perspectives, say, deep recommender systems and AutoML for neural architecture search.

Deep recommender systems have drawn increasing attention from both the academia and the industry thanks to its great advantages over traditional methods [44]. Various types of deep learning approaches in recommendation are developed. Sedhain et al. [34]

present an AutoEncoder based model named AutoRec. In their work, both item-based and user-based AutoRec are introduced. They are designed to capture the low-dimension feature embeddings of users and items, respectively. Hidasi et al. 

[13] introduce an RNN based recommender system named GRU4Rec. In session-based recommendation, the model captures the information from items’ transition sequences for prediction. They also design a session-parallel mini-batches algorithm and a sampling method for output, which make the training process more efficient. Cheng et al. [7]

introduce a Wide&Deep framework for both regression and classification tasks. The framework consists of a wide part, which is a linear model implemented as one layer of a feed-forward neural network, and a deep part, which contains multiple perceptron layers to learn abstract and deep representations. Guo et al. 

[11] propose the DeepFM model. It combines the factorization machine (FM) and MLP. The idea of it is to use the former to model the lower-order feature interactions while using the latter to learn the higher-order interactions. Wang et al. [40] attempt to utilize CNN to extract visual features to help POI (Point-of-Interest) recommendations. They build a PMF based framework that models the interactions between visual information and latent user/location factors. Chen et al. [6] introduce hierarchical attention mechanisms into recommendation models. They propose a collaborative filtering model with an item-level and a component-level attention mechanism. The item-level attention mechanism captures user representations by attending various items and the component-level one tries to figure out the most important features from auxiliary sources for each user. Wang et al. [38] propose a generative adversarial network (GAN) based information retrieval model, IRGAN, which is applied in the task of recommendation, and also web search and question answering.

The research of AutoML for neural architecture search can be traced back to NAS [50]

, which first utilizes an RNN based controller to design neural networks and proposes a reinforcement learning algorithm to optimize the framework. After that, many endeavors are conducted on reducing the high training cost of NAS. Pham et al. 

[28] propose ENAS, where the controller learns to search a subgraph from a large computational graph to form an optimal neural network architecture. Brock et al. [3] introduce a framework named SMASH, in which a hyper-network is developed to generate weights for sampled networks. DARTS [23] and SNAS [43] formulate the problem of network architecture search in a differentiable manner and solve it using gradient descent. Luo et al. [24] investigate representing network architectures as embeddings. Then they design a predictor to take the architecture embedding as input to predict its performance. They utilize gradient-based optimization to find an optimal embedding and decode it back to the network architecture. Some works raise another way of thinking, which is to limit the search space. The works [30, 48, 22, 4]

focus on searching convolution cells, which are stacked repeatedly to form a convolutional neural network. Zoph et al. 


propose a transfer learning framework called NASNet, which train convolution cells on smaller datasets and apply them on larger datasets. Tan et al. 

[36] introduce MNAS. They propose to search hierarchical convolution cell blocks in an independent manner, so that a deep network can be built based on them. Neural Input Search [16] and AutoEmb [47] are designed for tuning the embedding layer of deep recommender system. But they aim to tune the embedding sizes with in the same feature field, and are usually be used for user-id/item-id features.

V Conclusion

In this paper, we propose a novel framework AutoDim, which targets at automatically assigning different embedding dimensions to different feature fields in a data-driven manner. In real-world recommender systems, due to the huge amounts of feature fields and the highly complex relationships among embedding dimensions, feature distributions and neural network architectures, it is difficult, if possible, to manually allocate different dimensions to different feature fields. Thus, we proposed an AutoML based framework to automatically select from different embedding dimensions. To be specific, we first provide an end-to-end differentiable model, which computes the weights over different dimensions for different feature fields simultaneously in a soft and continuous form, and we propose an AutoML-based optimization algorithm; then according to the maximal weights, we derive a discrete embedding architecture, and re-train the DLRS parameters. We evaluate the AutoDim framework with extensive experiments based on widely used benchmark datasets. The results show that our framework can maintain or achieve slightly better performance with much fewer embedding space demands.

There are several interesting research directions. First, in addition to automatically select the embedding dimensions of categorical feature fields, we would also like to investigate the method to automatically handle numerical feature fields. Second, we would like to study the AutoML-based method to automatically design the whole DLRS architecture including both MLP and embedding components. Third, our proposed select a unified embedding dimension for each feature field, in the future, we would like to develop a model which can assign various embedding dimensions to different items in the same feature field. Finally, the framework is quite general to address information retrieval problems, thus we would like to investigate more applications of the proposed framework.


  • [1] J. Bao, Y. Zheng, D. Wilkie, and M. Mokbel (2015) Recommendations in location-based social networks: a survey. Geoinformatica 19 (3), pp. 525–565. Cited by: §I, §I.
  • [2] J. S. Breese, D. Heckerman, and C. Kadie (1998) Empirical analysis of predictive algorithms for collaborative filtering. In

    Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence

    pp. 43–52. Cited by: §I.
  • [3] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §IV.
  • [4] H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu (2018) Path-level network transformation for efficient architecture search. arXiv preprint arXiv:1806.02639. Cited by: §IV.
  • [5] C. Chang, M. Kayed, M. R. Girgis, and K. F. Shaalan (2006) A survey of web information extraction systems. IEEE transactions on knowledge and data engineering 18 (10), pp. 1411–1428. Cited by: §I.
  • [6] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T. Chua (2017) Attentive collaborative filtering: multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 335–344. Cited by: §IV.
  • [7] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §I, §II-B3, item 1, §III-C, §IV.
  • [8] P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pp. 191–198. Cited by: §I.
  • [9] A. Ginart, M. Naumov, D. Mudigere, J. Yang, and J. Zou (2019) Mixed dimension embeddings with application to memory-efficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §I, §I.
  • [10] E. J. Gumbel (1948) Statistical theory of extreme values and some practical applications: a series of lectures. Vol. 33, US Government Printing Office. Cited by: §II-B3.
  • [11] H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 1725–1731. Cited by: item 2, §III-C, §IV.
  • [12] X. He and T. Chua (2017) Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: item 2.
  • [13] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2015)

    Session-based recommendations with recurrent neural networks

    arXiv preprint arXiv:1511.06939. Cited by: §IV.
  • [14] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    pp. 448–456. Cited by: §II-B2.
  • [15] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §II-B3, §II-B3.
  • [16] M. R. Joglekar, C. Li, J. K. Adams, P. Khaitan, and Q. V. Le (2019) Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471. Cited by: §I, §IV.
  • [17] W. Kang, D. Z. Cheng, T. Chen, X. Yi, D. Lin, L. Hong, and E. H. Chi (2020) Learning multi-granular quantized embeddings for large-vocab categorical features in recommender systems. arXiv preprint arXiv:2002.08530. Cited by: §I.
  • [18] S. Li, J. Kawale, and Y. Fu (2015) Deep collaborative filtering via marginalized denoising auto-encoder. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 811–820. Cited by: §I.
  • [19] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun (2018) Xdeepfm: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: item 2.
  • [20] G. Linden, B. Smith, and J. York (2003) Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet computing 7 (1), pp. 76–80. Cited by: §I.
  • [21] B. Liu, C. Zhu, G. Li, W. Zhang, J. Lai, R. Tang, X. He, Z. Li, and Y. Yu (2020) AutoFIS: automatic feature interaction selection in factorization models for click-through rate prediction. arXiv preprint arXiv:2003.11235. Cited by: §III-A.
  • [22] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 19–34. Cited by: §IV.
  • [23] H. Liu, K. Simonyan, and Y. Yang (2018) Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055. Cited by: §II-A1, §II-C, 3rd item, §IV.
  • [24] R. Luo, F. Tian, T. Qin, E. Chen, and T. Liu (2018) Neural architecture optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7827–7838. Cited by: §IV.
  • [25] R. J. Mooney and L. Roy (2000) Content-based book recommending using learning for text categorization. In Proceedings of the fifth ACM conference on Digital libraries, pp. 195–204. Cited by: §I.
  • [26] H. T. Nguyen, M. Wistuba, J. Grabocka, L. R. Drumond, and L. Schmidt-Thieme (2017) Personalized deep learning for tag recommendation. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 186–197. Cited by: §I.
  • [27] J. Pan, J. Xu, A. L. Ruiz, W. Zhao, S. Pan, Y. Sun, and Q. Lu (2018) Field-weighted factorization machines for click-through rate prediction in display advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 1349–1357. Cited by: §I, item 2.
  • [28] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean (2018) Efficient neural architecture search via parameters sharing. In International Conference on Machine Learning, pp. 4095–4104. Cited by: §II-A1, §II-C, §IV.
  • [29] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang (2016) Product-based neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1149–1154. Cited by: §I, item 2.
  • [30] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4780–4789. Cited by: §IV.
  • [31] S. Rendle (2010) Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pp. 995–1000. Cited by: item 2.
  • [32] P. Resnick and H. R. Varian (1997) Recommender systems. Communications of the ACM 40 (3), pp. 56–58. Cited by: §I.
  • [33] F. Ricci, L. Rokach, and B. Shapira (2011) Introduction to recommender systems handbook. In Recommender systems handbook, pp. 1–35. Cited by: §I.
  • [34] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie (2015) Autorec: autoencoders meet collaborative filtering. In Proceedings of the 24th international conference on World Wide Web, pp. 111–112. Cited by: §IV.
  • [35] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang, and J. Tang (2019) Autoint: automatic feature interaction learning via self-attentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: item 2.
  • [36] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2820–2828. Cited by: §IV.
  • [37] Y. K. Tan, X. Xu, and Y. Liu (2016) Improved recurrent neural networks for session-based recommendations. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 17–22. Cited by: §II-B3.
  • [38] J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, and D. Zhang (2017) Irgan: a minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 515–524. Cited by: §IV.
  • [39] R. Wang, B. Fu, G. Fu, and M. Wang (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 1–7. Cited by: item 2.
  • [40] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, and H. Liu (2017) What your images reveal: exploiting visual contents for point-of-interest recommendation. In Proceedings of the 26th International Conference on World Wide Web, pp. 391–400. Cited by: §IV.
  • [41] S. Wu, W. Ren, C. Yu, G. Chen, D. Zhang, and J. Zhu (2016) Personal recommendation using deep recurrent neural networks in netease. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pp. 1218–1229. Cited by: §I.
  • [42] J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua (2017) Attentional factorization machines: learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617. Cited by: item 2.
  • [43] S. Xie, H. Zheng, C. Liu, and L. Lin (2018) SNAS: stochastic neural architecture search. arXiv preprint arXiv:1812.09926. Cited by: §IV.
  • [44] S. Zhang, L. Yao, A. Sun, and Y. Tay (2019) Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR) 52 (1), pp. 1–38. Cited by: §I, §IV.
  • [45] S. Zhang, L. Yao, and A. Sun (2017) Deep learning based recommender system: a survey and new perspectives. arXiv preprint arXiv:1707.07435. Cited by: §I.
  • [46] W. Zhang, T. Du, and J. Wang (2016) Deep learning over multi-field categorical data. In European conference on information retrieval, pp. 45–57. Cited by: item 2.
  • [47] X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, and J. Tang (2020) AutoEmb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §I, §IV.
  • [48] Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: §IV.
  • [49] G. Zhou, X. Zhu, C. Song, Y. Fan, H. Zhu, X. Ma, Y. Yan, J. Jin, H. Li, and K. Gai (2018) Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §I.
  • [50] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §IV.
  • [51] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §IV.