Differentiable Neural Input Search for Recommender Systems

06/08/2020 ∙ by Weiyu Cheng, et al. ∙ Shanghai Jiao Tong University 0

Latent factor models are the driving forces of the state-of-the-art recommender systems, with an important insight of vectorizing raw input features into dense embeddings. The dimensions of different feature embeddings are often set to a uniform value manually or through grid search, which may yield suboptimal model performance. Existing work applied heuristic methods or reinforcement learning to search for varying embedding dimensions. However, the embedding dimension per feature is rigidly chosen from a restricted set of candidates due to the scalability issue involved in the optimization process over a large search space. In this paper, we propose a differentiable neural input search algorithm towards learning more flexible dimensions of feature embeddings, namely a mixed dimension scheme, leading to better recommendation performance and lower memory cost. Our method can be seamlessly incorporated with various existing architectures of latent factor models for recommendation. We conduct experiments with 6 state-of-the-art model architectures on two typical recommendation tasks: Collaborative Filtering (CF) and Click-Through-Rate (CTR) prediction. The results demonstrate that our method achieves the best recommendation performance compared with 3 neural input search approaches over all the model architectures, and can reduce the number of embedding parameters by 2x and 20x on CF and CTR prediction, respectively.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most state-of-the-art recommender systems employ latent factor models that vectorize raw input features into dense embeddings. A key question often asked of feature embeddings is: “How should we determine the dimensions of feature embeddings?” The common practice is to set a uniform dimension for all the features, and treat the dimension as a hyperparameter that needs to be adjusted according to validation set. However, the manual search of a uniform embedding dimension can be computationally intensive and even result in suboptimal model performance, since a single dimension is not necessarily suitable for all the features. Intuitively, a larger dimension is needed for popular features that appear in most data samples, encouraging a higher model capacity to fit the related data samples 

Joglekar et al. (2019); Zhao et al. (2020). Likewise, less frequent features would rather be assigned with smaller dimensions to avoid overfitting on scarce data samples. As such, it is desirable to impose a mixed dimension scheme for different features towards better recommendation performance. Another notable fact is that embedding layers in industrial web-scale recommender systems Park et al. (2018); Covington et al. (2016) account for the majority of model parameters and can consume hundreds of gigabytes of memory space. Replacing a uniform feature embedding dimension with varying dimensions is the key to remove redundant embedding weights for infrequent and less predictive features, leading to lower memory cost.

Some recent works Ginart et al. (2019); Joglekar et al. (2019) have focused on searching for varying feature dimensions automatically, which is defined as the Neural Input Search (NIS) problem. Ginart et al. Ginart et al. (2019) proposed to use an empirical function to heuristically decide the embedding dimensions for different features according to their frequencies of occurrence, where the empirical function involves several hyperparameters that need to be carefully tuned to yield a good search result. Joglekar et al. Joglekar et al. (2019) proposed a reinforcement learning-based method for addressing the NIS problem. They first divided a base feature dimension equally into several blocks, and then applied reinforcement learning to produce decision sequences for different features on the selection of dimension blocks. These methods, however, restrict each feature dimension to be chosen from a small set of candidate dimensions that is explicitly predefined Joglekar et al. (2019) or implicitly controlled by hyperparameters Ginart et al. (2019). Although this restriction reduces search space and thereby improves computational efficiency, another question then arises: how to decide the candidate dimensions? Notably, a suboptimal set of candidate dimensions could result in a suboptimal search result that hurts model’s recommendation performance.

In this paper, we propose Differentiable Neural Input Search (DNIS) for approaching the NIS problem in a differentiable manner through gradient descent. Instead of searching over a predefined discrete set of candidate dimensions, DNIS relaxes the search space to be continuous and optimizes the selection for each feature dimension by descending model’s validation loss. More specifically, we introduce a soft selection layer

between the embedding layer and the feature interaction layers of latent factor models. Each input feature embedding is fed into the soft selection layer to perform an element-wise multiplication with a scaling vector. The soft selection layer directly controls the significance of each dimension of the feature embedding, and it is essentially a part of model architecture which can be optimized according to model’s validation performance. We also propose a gradient normalization technique to keep the backpropagated gradients steady during the optimization of the selection layer. After training, we merge the soft selection layer with the feature embedding layer to prune redundant or less informative embedding dimensions per feature, leading to feature embeddings with a mixed dimension scheme. DNIS can be seamlessly applied to various existing architectures of latent factor models for recommendation. We conduct extensive experiments with different model architectures on the Collaborative Filtering (CF) task and the Click-Through-Rate (CTR) prediction task. The results demonstrate that our DNIS method achieves the best performance compared with the existing neural input search baselines over all the model architectures, and can increase parameter efficiency by pruning over

and embedding weights for CF and CTR prediction, respectively.

The major contributions of this paper can be summarized as follows:

  • We propose DNIS, a differentiable neural input search method to relax the NIS search space to be continuous, which allows searching for varying feature dimensions automatically in a differentiable manner with gradient descent.

  • We introduce a soft selection layer to optimize the selection of embedding dimensions for different features. A gradient normalization technique is proposed to keep the backpropagated gradients steady during the training of the soft selection layer.

  • Our method can be incorporated with various existing architectures of latent factor models to improve recommendation performance and reduce memory cost of embedding parameters.

  • We conduct experiments with different model architectures on CF and CTR prediction tasks. The results demonstrate our DNIS method outperforms the existing NIS baselines in terms of recommendation performance, training efficiency and parameter size.

2 Differentiable Neural Input Search

2.1 Background

Latent factor models. We consider a recommender system involving feature fields (e.g., user ID, item ID, item price). Typically, is (including user ID and item ID) in collaborative filtering (CF) problems, whereas in the context of click-through rate (CTR) prediction, is much larger than to include more feature fields. Each categorical feature field consists of a collection of discrete features, while a numerical feature field contains one scalar feature. Let denote the list of features over all the fields and the size of is . For the -th feature in , its initial representation is a -dimensional sparse vector , where the -th element is 1 (for discrete feature) or a scalar number (for scalar feature), and the others are 0s. Latent factor models generally consists of two parts: one feature embedding layer, followed by the feature interaction layers. Without loss of generality, the input instances to the latent factor model include several features belonging to the respective feature fields. The feature embedding layer transforms all the features in an input instance into dense embedding vectors. Specifically, a sparsely encoded input feature vector is transformed into a -dimensional embedding vector as follows:


where is known as the embedding matrix. The output of the feature embedding layer is the collection of dense embedding vectors for all the input features, which is denoted as . The feature interaction layers, which are designed to be different architectures, essentially compose a parameterized function that predicts the objective based on the collected dense feature embeddings for the input instance. That is,


where is the model’s prediction, and denotes the set of parameters in the interaction layers. Prior works have developed various architectures for , including the simple inner product function Rendle (2010)

, and deep neural networks-based interaction functions 

He et al. (2017); Cheng et al. (2018); Lian et al. (2018); Cheng et al. (2016); Guo et al. (2017). Most of the proposed architectures for the interaction layers require all the feature embeddings to be in a uniform dimension.

Neural architecture search. Neural Architecture Search (NAS) has been proposed to automatically search for the best neural network architecture. To explore the space of neural architectures, different search strategies have been explored including random search Li and Talwalkar (2019), evolutionary methods Elsken et al. (2019); Miller et al. (1989); Real et al. (2017), Bayesian optimization Bergstra et al. (2013); Domhan et al. (2015); Mendoza et al. (2016), reinforcement learning Baker et al. (2017); Zhong et al. (2018); Zoph and Le (2017), and gradient-based methods Cai et al. (2019); Liu et al. (2019a); Xie et al. (2019). Since being proposed in Baker et al. (2017); Zoph and Le (2017), NAS has achieved remarkable performance in various tasks such as image classification Real et al. (2019); Zoph et al. (2018), semantic segmentation Chen et al. (2018) and object detection Zoph et al. (2018). However, most of these researches have focused on searching for optimal network structures automatically, while little attention has been paid to the design of the input component. This is because the input component in visual tasks is already given in the form of floating point values of image pixels. As for recommender systems, an input component based on the embedding layer is deliberately developed to transform raw features (e.g., discrete user identifiers) into dense embeddings. In this paper, we focus on the problem of neural input search, which can be considered as NAS on the input component (i.e., the embedding layer) of recommender systems.

2.2 Search Space and Problem

Search space. The key idea of neural input search is to use embeddings with mixed dimensions to represent different features. To formulate feature embeddings with different dimensions, we adopt the representation for sparse vectors (with a base dimension ). Specifically, for each feature, we maintain a dimension index vector which contains ordered locations of the feature’s existing dimensions from the set , and an embedding value vector which stores embedding values in the respective existing dimensions. The conversion from the index and value vectors of a feature into the -dimensional embedding vector is straightforward. Note that corresponds to a row in the embedding matrix . Figure 0(a) gives an example of , and for the -th feature in .

(a) An example of problem notations.
(b) Model structure.
Figure 1: A demonstration of notations and model structure.

The size of varies among different features to enforce a mixed dimension scheme. Formally, given the feature set , we define the mixed dimension scheme to be the collection of dimension index vectors for all the features in . We use to denote the search space of the mixed dimension scheme for , which includes possible choices. Besides, we denote by the set of the embedding value vectors for all the features in . Then we can derive the embedding matrix with and to make use of the feature interaction layers.
Problem formulation. Let be the set of trainable model parameters, and and are model’s training loss and validation loss, respectively. The two losses are determined by both the mixed dimension scheme , and the trainable parameters . The goal of neural input search is to find a mixed dimension scheme that minimizes the validation loss , where the parameters given any mixed dimension scheme are obtained by minimizing the training loss. This can be formulated as:


The above problem formulation is actually consistent with hyperparameter optimization in a broader scope Maclaurin et al. (2015); Pedregosa (2016); Franceschi et al. (2018), since the mixed dimension scheme can be considered as model hyperparameters to be determined according to model’s validation performance. However, the main difference is that the search space in our problem is much larger than the search space of conventional hyperparameter optimization problems.

2.3 Feature Blocking

Feature blocking has been a novel ingredient used in the existing neural input search methods Joglekar et al. (2019); Ginart et al. (2019) to facilitate the reduction of search space. The intuition behind is that features with similar frequencies could be grouped into a block sharing the same embedding dimension. Following the existing works, we first employ feature blocking to control the search space of the mixed dimension scheme. We sort all the features in in the descending order of frequency (i.e., the number of feature occurrences in the training instances). Let denote the frequency of feature . We can obtain a sorted list of features such that for any . We then separate into blocks, where the features in a block share the same dimension index vector . We denote by the mixed dimension scheme after feature blocking. Then the length of the mixed dimension scheme becomes , and the search space size is reduced from to accordingly, where .

2.4 Continuous Relaxation and Differentiable Optimization

Continuous relaxation. After feature blocking, in order to optimize the mixed dimension scheme , we first transform into a binary dimension indicator matrix , where each element in is either 1 or 0 indicating the existence of the corresponding embedding dimension according to . We then introduce a soft selection layer to relax the search space of to be continuous. The soft selection layer is essentially a numerical matrix , where each element in satisfies: . That is, each binary choice (the existence of the -th embedding dimension in the -th feature block) in , is relaxed to be a continuous variable within the range of . We insert the soft selection layer between the feature embedding layer and interaction layers in the latent factor model, as illustrated in Figure 0(b). Given and the embedding matrix , the output embedding of a feature in the -th block produced by the bottom two layers can be computed as follows:


where is the -th row in , and is the element-wise product. By applying Equation (4) to all the input features, we can obtain the output feature embeddings . Next, we supply to the feature interaction layers for final prediction as specified in Equation (2). Note that is used to softly select the dimensions of feature embeddings during model training, and the discrete mixed dimension scheme will be derived after training.

Differentiable optimization. Now that we relax the mixed dimension scheme (after feature blocking) via the soft selection layer , our problem stated in Equation (3) can be transformed into:


where that represents model parameters in both the embedding layer and interaction layers. Equation 5 essentially defines a bi-level optimization problem Colson et al. (2007), which has been studied in differentiable NAS Liu et al. (2019a) and gradient-based hyperparameter optimization Chen et al. (2019); Franceschi et al. (2018); Pedregosa (2016). Basically, and are respectively treated as the upper-level and lower-level variables to be optimized in an interleaving way. To deal with the expensive optimization of , we follow the common practice that approximates by adapting using a single training step:


where is the learning rate for one-step update of model parameters . Then we can optimize based on the following gradient:


where denotes the model parameters after one-step update. Equation (7

) can be solved efficiently using the existing deep learning libraries that allow automatic differentiation, such as Pytorch 

Paszke et al. (2019). The second-order derivative term in Equation (7) can be omitted to further improve computational efficiency considering to be near zero, which is called the first-order approximation. In this paper, we adopt the first-order approximation in DNIS by default since we find the final performance is similar with and without the approximation. Algorithm 1 (line 5-10) summarizes the bi-level optimization procedure for solving Equation (5).
Gradient normalization. During the optimization of by the gradient , we propose a gradient normalization technique to normalize the row-wise gradients of over each training batch:


where and denote the gradients before and after normalization respectively, and is a small value (e.g., 1e-7) to avoid numerical overflow. The consideration is that the magnitude of the gradients of varies a lot over feature blocks due to the significant difference in feature frequency. By normalizing the gradients for each block, we can apply a single learning rate to different rows of during optimization. Otherwise, a single learning rate shared by different feature blocks may easily fall short in optimizing all the rows of .

0 Input: training dataset, validation dataset.
0 Output: mixed dimension scheme , embedding values , interaction function parameters .
1 Sort features into and divide them into blocks;
Initialize the soft selection layer to be an all-one matrix, and randomly initialize ;
2 while not converged do
3       Update trainable parameters by descending ;
4       Calculate the gradients of as: ;
       // (set if using first-order approximation)
5       Perform Equation (8) to normalize the gradients in ;
6       Update by descending the gradients, and then clip its values into the range of ;
8 end while
9Calculate the output embedding matrix using and according to Equation (4);
10 Prune into a sparse matrix following Equation (9);
11 Derive the mixed dimension scheme and embedding values with ;
Algorithm 1 DNIS - Differentiable Neural Input Search

2.5 Deriving Feature Embeddings in Mixed Dimensions

After optimization, we have the learned parameters for , and . This allows us to derive the discrete mixed dimension scheme . Specifically, for feature in the -th block, we can compute its output embedding with and following Equation (4). By merging the embedding layer with the soft selection layer, we collect the output embeddings for all the features in and form an output embedding matrix . We then prune non-informative embedding dimensions in as follows:


where is a threshold that can be manually tuned according to the requirements on model performance and computational resources. The pruned output embedding matrix is sparse and can be used to derive the discrete mixed dimension scheme and the embedding value vectors for accordingly.

Relation to network pruning.

Network pruning, as one kind of model compression techniques, improves the efficiency of over-parameterized deep neural networks by removing redundant neurons or connections without damaging model performance 

Cheng et al. (2017); Liu et al. (2019b); Frankle and Carbin (2019). Recent works of network pruning Han et al. (2015); Molchanov et al. (2017); Li et al. (2017) generally performed iterative pruning and finetuning over certain pretrained over-parameterized deep network. Instead of simply removing redundant weights, our proposed method DNIS optimizes feature embeddings with the gradients from the validation set, and only prunes non-informative embedding dimensions and their values in one shot after model training. This also avoids manually tuning thresholds and regularization terms per iteration. We have conducted experiments to compare the performance of DNIS and network pruning methods in Section 3.4.

3 Experiments

3.1 Experimental Settings

Datasets. We used two benchmark datasets Movielens Harper and Konstan (2016) and Criteo Labs (2014) for collaborative filtering (CF) and click-through rate (CTR) prediction tasks, respectively. For each dataset, we randomly split the instances by 8:1:1 to obtain the training, validation and test sets. The statistics of the two datasets are summarized in Table 1.
(1) Movielens consists of more than 20 million user ratings ranging from 1 to 5 on different movies.
(2) Criteo is a popular industry benchmark dataset for CTR prediction, which contains 13 numerical feature fields and 26 categorical feature fields. Each label indicates whether a user has clicked the corresponding item.

Dataset Task Type Instance# Field# Feature#
Movielens Rating Prediction 20,000,263 2 165,771
Criteo CTR Prediction 45,840,617 39 2,086,936
Table 1: Statistics of the datasets.

Evaluation metrics. We adopt MSE (mean squared error) for rating prediction in CF, and use AUC (Area Under the ROC Curve) and Logloss (cross entropy) for CTR prediction. In addition to model performance, we also report the parameter size and the search cost of each method.
Comparison methods. We compare our DNIS method with the following three approaches.
Grid Search

. This is the traditional approach to searching for a uniform embedding dimension. In our experiments, we searched 16 groups of dimensions, ranging from 4 to 64 with a stride of 4.

Random Search. Random search has been recognized as a strong baseline for NAS problems Liu et al. (2019a). When random searching a mixed dimension scheme, we applied the same feature blocking as we did for DNIS. Following the intuition that high-frequency features desire larger numbers of dimensions, we generated 16 random descending sequences as the search space of the mixed dimension scheme for each model and report the best results.
MDE (Mixed Dimension Embeddings Ginart et al. (2019)). This method performs feature blocking and applies a heuristic scheme where the number of dimensions per feature block is proportional to some fractional power of its frequency. We tested 16 groups of hyperparameters settings as suggested in the original paper and report the best results.
For DNIS, we show its performance before and after the dimension pruning in Equation (9), and report the storage size of the pruned sparse matrix using COO format of sparse matrix Virtanen et al. (2020). We show the results with different compression rates (CR), i.e., the division of unpruned embedding parameter size by the pruned size.
Implementation details. We implement our method using Pytorch Paszke et al. (2019). We apply Adam optimizer with the learning rate of 0.001 for model parameters and that of 0.01 for soft selection layer parameters . The mini-batch size is set to 4096 and the uniform base dimension is set to 64 for all the models. We apply the same blocking scheme for random search, MDE and DNIS for a fair comparison. The default numbers of feature blocks is set to 10 and 6 for Movielens and Criteo datasets, respectively. We employ various latent factor models: MF, MLP He et al. (2017) and NeuMF He et al. (2017) for the CF task, and FM Rendle (2010), Wide&Deep Cheng et al. (2016), DeepFM Guo et al. (2017) for the CTR prediction, where the configuration of latent factor models are the same over different methods. Besides, we exploit early-stopping for all the methods according to the change of validation loss during model training. All the experiments were performed using NVIDIA GeForce RTX 2080Ti GPUs.

3.2 Comparison Results

Search Methods MF MLP NeuMF
Params Time Cost MSE Params Time Cost MSE Params Time Cost MSE
(M) (M) (M)
Grid Search 33 16 0.622 35 8 0.640 61 4 0.625
Random Search 33 16 0.6153 22 4 0.6361 30 2 0.6238
MDE 35 24 0.6138 35 5 0.6312 27 3 0.6249
DNIS (unpruned) 37 1 0.6096 36 1 0.6255 72 1 0.6146
DNIS () 21 1 0.6126 20 1 0.6303 40 1 0.6169
DNIS () 17 1 0.6167 17 1 0.6361 32 1 0.6213
Table 2: Comparison between DNIS and baselines on the CF task using Movielens dataset. We also report the storage size of the derived feature embeddings and the training time per method. For DNIS, we show its results with and w/o different compression rates (CR), i.e., the ratio of the embedding parameter size w/o pruning to that after pruning.
Search Methods FM Wide&Deep DeepFM
Params Time Cost AUC Logloss Params Time Cost AUC Logloss Params Time Cost AUC Logloss
(M) (M) (M)
Grid Search 441 16 0.7987 0.4525 254 16 0.8079 0.4435 382 14 0.8080 0.4435
Random Search 73 12 0.7997 0.4518 105 16 0.8084 0.4434 105 12 0.8084 0.4434
MDE 397 16 0.7986 0.4530 196 16 0.8076 0.4439 396 16 0.8077 0.4438
DNIS (unpruned) 441 3 0.8004 0.4510 395 3 0.8088 0.4429 416 3 0.8090 0.4427
DNIS () 26 3 0.8004 0.4510 29 3 0.8087 0.4430 29 3 0.8088 0.4428
DNIS () 17 3 0.8004 0.4510 19 3 0.8085 0.4432 20 3 0.8086 0.4430
Table 3: Comparison between DNIS and baselines on the CTR prediction task using Criteo dataset.

Table 2 and Table 3 show the comparison results of different NIS methods on CF and CTR prediction tasks, respectively. First, we can see that DNIS achieves the best prediction performance over all the model architectures for both tasks. It is worth noticing that the improvement on training efficiency ranges from to over

. The results confirms that DNIS is able to learn discriminative feature embeddings with significantly higher efficiency than the existing search methods. Second, DNIS with dimension pruning achieves competitive or better performance than baselines, and can yield a significant reduction on model parameter size. For example, DNIS with a pruning rate (PR) of 2 outperforms all the baselines on Movielens, and yet reaches the minimal parameter size. The advantages of DNIS with the CR of 20 and 30 are more significant on Criteo. We observe that DNIS can achieve a higher CR on Criteo than Movielens without sacrificing prediction performance. This is because the distribution of feature frequency on Criteo is severely skewed, leading to a significantly large number of redundant dimensions for low-frequency features. Third, among all the baselines, MDE performs the best on Movielens and Random Search performs the best on Criteo, while Grid Search gets the worst results on both tasks. This verifies the importance of applying mixed dimension embeddings to latent factor models. Note that all of the three baselines have searched over 16 groups of feature dimensions, and their time costs are slightly different due to the early-stopping of model training. Fourth, we find that MF achieves better prediction performance on the CF task than the other two model architectures. The reason may be the overfitting problem of MLP and NeuMF that results in poor generalization. Besides, DeepFM show the best results on the CTR prediction task, suggesting that the ensemble of DNN and FM is beneficial to improving CTR prediction accuracy.

3.3 Hyperparameter Investigation

(a) MSE vs .
(b) MSE vs .
Figure 2: Effects of hyperparameters on the performance of DNIS. We report the MSE results of MF on Movielens dataset w.r.t. different base embedding dimensions and feature block numbers .

We investigate the effects of two important hyperparameters and in DNIS. Figure 1(a) shows the performance change of MF w.r.t. different settings of . We can see that increasing is beneficial to reducing MSE. This is because a larger allows a larger search space that could improve the representations of high-frequency features by giving more embedding dimensions. Besides, we observe a marginal decrease in performance gain. Specifically, the MSE is reduced by 0.005 when increases from 64 to 128, whereas the MSE reduction is merely 0.001 when changes from 512 to 1024. This implies that may have exceeded the largest number of dimensions required by all the features, leading to minor improvements. Figure 1(b) shows the effects of the number of feature blocks . We find that increasing improves the prediction performance of DNIS, and the performance improvement decreases as becomes larger. This is because dividing features into more blocks facilitates a finer-grained control on the embedding dimensions of different features, leading to more flexible mixed dimension schemes. Since both and affect the computation complexity of DNIS, we suggest to choose reasonably large values for and to balance the computational efficiency and predictive performance based on the application requirements.

3.4 Analysis on DNIS Results

(a) in different feature blocks.
(b) Embedding dimension vs .
(c) Performance vs Pruning rates.
Figure 3: (a) The distribution of trained parameters of the soft selection layer. Here we show the result of MF on Movielens dataset, where

is set to 10. (b) The joint distribution plot of feature embedding dimensions and feature frequencies after dimension pruning. (c) Comparison of DNIS and network pruning performance over different pruning rates.

We first study the learned feature dimensions of DNIS through the learned soft selection layer and feature embedding dimensions after dimension pruning. Figure 2(a) depicts the distributions of the trained parameters in for the 10 feature blocks on Movielens. Recall that the blocks are sorted in the descending order of feature frequency. It can be seen that the learned parameters in for the feature blocks with lower frequencies converge to smaller values, indicating that lower-frequency features tend to be represented by smaller numbers of embedding dimensions. Figure 2(b) provides the number of embedding dimensions per feature after dimension pruning. The results show that features with higher frequencies end up with more embedding dimensions, whereas the dimensions are more likely to be pruned for low-frequency features. Nevertheless, there is no strong correlation between the derived embedding dimension and the feature frequency. Note that the embedding dimensions for low-frequency features scatter over a long range of numbers. This is consistent with the inferior performance of MDE which directly determines the number of feature embedding dimensions according to the frequency.

We further compare DNIS with network pruning method Han et al. (2015). For illustration purpose, we provide the results of the FM model on Criteo dataset. Figure 2(c) shows the performance of two methods on different pruning rates (i.e., the ratio of pruned embedding values). DNIS achieves better AUC and Logloss results than network pruning over all the pruning rates. This is because DNIS optimizes feature embeddings with the gradients from the validation set, which benefits the selection of predictive dimensions, instead of simply removing redundant weights in the embeddings.

4 Conclusion

In this paper, we introduced Differentiable Neural Input Search (DNIS), which searches for a mixed dimension scheme for different features adaptively from data. Instead of selecting from a predefined discrete set of candidate dimension schemes, DNIS is able to optimize embedding dimensions in a continuous search space with gradient descent. The key idea is to develop a soft dimension selection layer that controls the significance of each embedding dimension, and can be optimized with model’s validation performance through gradient descent. We show that DNIS can be seamlessly incorporated with various existing latent factor models for recommendation. We conduct extensive experiments on collaborative filtering and click-through rate prediction tasks, where DNIS outperforms the existing NIS baselines in terms of recommendation performance, training efficiency and parameter size.


  • B. Baker, O. Gupta, N. Naik, and R. Raskar (2017) Designing neural network architectures using reinforcement learning. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.1.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. See DBLP:conf/icml/2013, pp. 115–123. External Links: Link Cited by: §2.1.
  • H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: direct neural architecture search on target task and hardware. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1.
  • L. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. See DBLP:conf/nips/2018, pp. 8713–8724. External Links: Link Cited by: §2.1.
  • Y. Chen, B. Chen, X. He, C. Gao, Y. Li, J. Lou, and Y. Wang (2019) opt: learn to regularize recommender models in finer levels. See DBLP:conf/kdd/2019, pp. 978–986. External Links: Link, Document Cited by: §2.4.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah (2016) Wide & deep learning for recommender systems. See DBLP:conf/recsys/2016dlrs, pp. 7–10. External Links: Link, Document Cited by: §2.1, §3.1.
  • W. Cheng, Y. Shen, Y. Zhu, and L. Huang (2018) DELF: A dual-embedding based deep latent factor model for recommendation. See DBLP:conf/ijcai/2018, pp. 3329–3335. External Links: Link, Document Cited by: §2.1.
  • Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2017) A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: §2.5.
  • B. Colson, P. Marcotte, and G. Savard (2007) An overview of bilevel optimization. Annals OR 153 (1), pp. 235–256. External Links: Link, Document Cited by: §2.4.
  • P. Covington, J. Adams, and E. Sargin (2016) Deep neural networks for youtube recommendations. See DBLP:conf/recsys/2016, pp. 191–198. External Links: Link, Document Cited by: §1.
  • T. Domhan, J. T. Springenberg, and F. Hutter (2015) Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. See DBLP:conf/ijcai/2015, pp. 3460–3468. External Links: Link Cited by: §2.1.
  • T. Elsken, J. H. Metzen, and F. Hutter (2019) Efficient multi-objective neural architecture search via lamarckian evolution. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1.
  • L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil (2018) Bilevel programming for hyperparameter optimization and meta-learning. See DBLP:conf/icml/2018, pp. 1563–1572. External Links: Link Cited by: §2.2, §2.4.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.5.
  • A. Ginart, M. Naumov, D. Mudigere, J. Yang, and J. Zou (2019) Mixed dimension embeddings with application to memory-efficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §1, §2.3, §3.1.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: A factorization-machine based neural network for CTR prediction. See DBLP:conf/ijcai/2017, pp. 1725–1731. External Links: Link, Document Cited by: §2.1, §3.1.
  • S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2.5, §3.4.
  • F. M. Harper and J. A. Konstan (2016) The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4), pp. 19:1–19:19. External Links: Link, Document Cited by: §3.1.
  • X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. See DBLP:conf/www/2017, pp. 173–182. External Links: Link, Document Cited by: §2.1, §3.1.
  • M. R. Joglekar, C. Li, J. K. Adams, P. Khaitan, and Q. V. Le (2019) Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471. Cited by: §1, §1, §2.3.
  • C. Labs (2014) External Links: Link Cited by: §3.1.
  • H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient convnets. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.5.
  • L. Li and A. Talwalkar (2019) Random search and reproducibility for neural architecture search. See DBLP:conf/uai/2019, pp. 129. External Links: Link Cited by: §2.1.
  • J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, and G. Sun (2018) XDeepFM: combining explicit and implicit feature interactions for recommender systems. See DBLP:conf/kdd/2018, pp. 1754–1763. External Links: Link, Document Cited by: §2.1.
  • H. Liu, K. Simonyan, and Y. Yang (2019a) DARTS: differentiable architecture search. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1, §2.4, §3.1.
  • Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019b) Rethinking the value of network pruning. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.5.
  • D. Maclaurin, D. Duvenaud, and R. P. Adams (2015) Gradient-based hyperparameter optimization through reversible learning. See DBLP:conf/icml/2015, pp. 2113–2122. External Links: Link Cited by: §2.2.
  • H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and F. Hutter (2016) Towards automatically-tuned neural networks. See DBLP:conf/icml/2016automl, pp. 58–65. External Links: Link Cited by: §2.1.
  • G. F. Miller, P. M. Todd, and S. U. Hegde (1989)

    Designing neural networks using genetic algorithms

    See DBLP:conf/icga/1989, pp. 379–384. Cited by: §2.1.
  • P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2017)

    Pruning convolutional neural networks for resource efficient inference

    See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.5.
  • J. Park, M. Naumov, P. Basu, S. Deng, A. Kalaiah, D. Khudia, J. Law, P. Malani, A. Malevich, S. Nadathur, et al. (2018) Deep learning inference in facebook data centers: characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886. Cited by: §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. See DBLP:conf/nips/2019, pp. 8024–8035. External Links: Link Cited by: §2.4, §3.1.
  • F. Pedregosa (2016) Hyperparameter optimization with approximate gradient. See DBLP:conf/icml/2016, pp. 737–746. External Links: Link Cited by: §2.2, §2.4.
  • E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    See DBLP:conf/aaai/2019, pp. 4780–4789. External Links: Link, Document Cited by: §2.1.
  • E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017) Large-scale evolution of image classifiers. See DBLP:conf/icml/2017, pp. 2902–2911. External Links: Link Cited by: §2.1.
  • S. Rendle (2010) Factorization machines. See DBLP:conf/icdm/2010, pp. 995–1000. External Links: Link, Document Cited by: §2.1, §3.1.
  • P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. Jarrod Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. Carey, İ. Polat, Y. Feng, E. W. Moore, J. Vand erPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and S. 1. Contributors (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Document Cited by: §3.1.
  • S. Xie, H. Zheng, C. Liu, and L. Lin (2019) SNAS: stochastic neural architecture search. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1.
  • X. Zhao, C. Wang, M. Chen, X. Zheng, X. Liu, and J. Tang (2020) AutoEmb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §1.
  • Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. See DBLP:conf/cvpr/2018, pp. 2423–2432. External Links: Link, Document Cited by: §2.1.
  • B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.1.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. See DBLP:conf/cvpr/2018, pp. 8697–8710. External Links: Link, Document Cited by: §2.1.