1 Introduction
Most stateoftheart recommender systems employ latent factor models that vectorize raw input features into dense embeddings. A key question often asked of feature embeddings is: “How should we determine the dimensions of feature embeddings?” The common practice is to set a uniform dimension for all the features, and treat the dimension as a hyperparameter that needs to be adjusted according to validation set. However, the manual search of a uniform embedding dimension can be computationally intensive and even result in suboptimal model performance, since a single dimension is not necessarily suitable for all the features. Intuitively, a larger dimension is needed for popular features that appear in most data samples, encouraging a higher model capacity to fit the related data samples
Joglekar et al. (2019); Zhao et al. (2020). Likewise, less frequent features would rather be assigned with smaller dimensions to avoid overfitting on scarce data samples. As such, it is desirable to impose a mixed dimension scheme for different features towards better recommendation performance. Another notable fact is that embedding layers in industrial webscale recommender systems Park et al. (2018); Covington et al. (2016) account for the majority of model parameters and can consume hundreds of gigabytes of memory space. Replacing a uniform feature embedding dimension with varying dimensions is the key to remove redundant embedding weights for infrequent and less predictive features, leading to lower memory cost.Some recent works Ginart et al. (2019); Joglekar et al. (2019) have focused on searching for varying feature dimensions automatically, which is defined as the Neural Input Search (NIS) problem. Ginart et al. Ginart et al. (2019) proposed to use an empirical function to heuristically decide the embedding dimensions for different features according to their frequencies of occurrence, where the empirical function involves several hyperparameters that need to be carefully tuned to yield a good search result. Joglekar et al. Joglekar et al. (2019) proposed a reinforcement learningbased method for addressing the NIS problem. They first divided a base feature dimension equally into several blocks, and then applied reinforcement learning to produce decision sequences for different features on the selection of dimension blocks. These methods, however, restrict each feature dimension to be chosen from a small set of candidate dimensions that is explicitly predefined Joglekar et al. (2019) or implicitly controlled by hyperparameters Ginart et al. (2019). Although this restriction reduces search space and thereby improves computational efficiency, another question then arises: how to decide the candidate dimensions? Notably, a suboptimal set of candidate dimensions could result in a suboptimal search result that hurts model’s recommendation performance.
In this paper, we propose Differentiable Neural Input Search (DNIS) for approaching the NIS problem in a differentiable manner through gradient descent. Instead of searching over a predefined discrete set of candidate dimensions, DNIS relaxes the search space to be continuous and optimizes the selection for each feature dimension by descending model’s validation loss. More specifically, we introduce a soft selection layer
between the embedding layer and the feature interaction layers of latent factor models. Each input feature embedding is fed into the soft selection layer to perform an elementwise multiplication with a scaling vector. The soft selection layer directly controls the significance of each dimension of the feature embedding, and it is essentially a part of model architecture which can be optimized according to model’s validation performance. We also propose a gradient normalization technique to keep the backpropagated gradients steady during the optimization of the selection layer. After training, we merge the soft selection layer with the feature embedding layer to prune redundant or less informative embedding dimensions per feature, leading to feature embeddings with a mixed dimension scheme. DNIS can be seamlessly applied to various existing architectures of latent factor models for recommendation. We conduct extensive experiments with different model architectures on the Collaborative Filtering (CF) task and the ClickThroughRate (CTR) prediction task. The results demonstrate that our DNIS method achieves the best performance compared with the existing neural input search baselines over all the model architectures, and can increase parameter efficiency by pruning over
and embedding weights for CF and CTR prediction, respectively.The major contributions of this paper can be summarized as follows:

We propose DNIS, a differentiable neural input search method to relax the NIS search space to be continuous, which allows searching for varying feature dimensions automatically in a differentiable manner with gradient descent.

We introduce a soft selection layer to optimize the selection of embedding dimensions for different features. A gradient normalization technique is proposed to keep the backpropagated gradients steady during the training of the soft selection layer.

Our method can be incorporated with various existing architectures of latent factor models to improve recommendation performance and reduce memory cost of embedding parameters.

We conduct experiments with different model architectures on CF and CTR prediction tasks. The results demonstrate our DNIS method outperforms the existing NIS baselines in terms of recommendation performance, training efficiency and parameter size.
2 Differentiable Neural Input Search
2.1 Background
Latent factor models. We consider a recommender system involving feature fields (e.g., user ID, item ID, item price). Typically, is (including user ID and item ID) in collaborative filtering (CF) problems, whereas in the context of clickthrough rate (CTR) prediction, is much larger than to include more feature fields. Each categorical feature field consists of a collection of discrete features, while a numerical feature field contains one scalar feature. Let denote the list of features over all the fields and the size of is . For the th feature in , its initial representation is a dimensional sparse vector , where the th element is 1 (for discrete feature) or a scalar number (for scalar feature), and the others are 0s. Latent factor models generally consists of two parts: one feature embedding layer, followed by the feature interaction layers. Without loss of generality, the input instances to the latent factor model include several features belonging to the respective feature fields. The feature embedding layer transforms all the features in an input instance into dense embedding vectors. Specifically, a sparsely encoded input feature vector is transformed into a dimensional embedding vector as follows:
(1) 
where is known as the embedding matrix. The output of the feature embedding layer is the collection of dense embedding vectors for all the input features, which is denoted as . The feature interaction layers, which are designed to be different architectures, essentially compose a parameterized function that predicts the objective based on the collected dense feature embeddings for the input instance. That is,
(2) 
where is the model’s prediction, and denotes the set of parameters in the interaction layers. Prior works have developed various architectures for , including the simple inner product function Rendle (2010)
, and deep neural networksbased interaction functions
He et al. (2017); Cheng et al. (2018); Lian et al. (2018); Cheng et al. (2016); Guo et al. (2017). Most of the proposed architectures for the interaction layers require all the feature embeddings to be in a uniform dimension.Neural architecture search. Neural Architecture Search (NAS) has been proposed to automatically search for the best neural network architecture. To explore the space of neural architectures, different search strategies have been explored including random search Li and Talwalkar (2019), evolutionary methods Elsken et al. (2019); Miller et al. (1989); Real et al. (2017), Bayesian optimization Bergstra et al. (2013); Domhan et al. (2015); Mendoza et al. (2016), reinforcement learning Baker et al. (2017); Zhong et al. (2018); Zoph and Le (2017), and gradientbased methods Cai et al. (2019); Liu et al. (2019a); Xie et al. (2019). Since being proposed in Baker et al. (2017); Zoph and Le (2017), NAS has achieved remarkable performance in various tasks such as image classification Real et al. (2019); Zoph et al. (2018), semantic segmentation Chen et al. (2018) and object detection Zoph et al. (2018). However, most of these researches have focused on searching for optimal network structures automatically, while little attention has been paid to the design of the input component. This is because the input component in visual tasks is already given in the form of floating point values of image pixels. As for recommender systems, an input component based on the embedding layer is deliberately developed to transform raw features (e.g., discrete user identifiers) into dense embeddings. In this paper, we focus on the problem of neural input search, which can be considered as NAS on the input component (i.e., the embedding layer) of recommender systems.
2.2 Search Space and Problem
Search space. The key idea of neural input search is to use embeddings with mixed dimensions to represent different features. To formulate feature embeddings with different dimensions, we adopt the representation for sparse vectors (with a base dimension ). Specifically, for each feature, we maintain a dimension index vector which contains ordered locations of the feature’s existing dimensions from the set , and an embedding value vector which stores embedding values in the respective existing dimensions. The conversion from the index and value vectors of a feature into the dimensional embedding vector is straightforward. Note that corresponds to a row in the embedding matrix . Figure 0(a) gives an example of , and for the th feature in .
The size of varies among different features to enforce a mixed dimension scheme. Formally, given the feature set , we define the mixed dimension scheme to be the collection of dimension index vectors for all the features in .
We use to denote the search space of the mixed dimension scheme for , which includes possible choices. Besides, we denote by the set of the embedding value vectors for all the features in . Then we can derive the embedding matrix with and to make use of the feature interaction layers.
Problem formulation.
Let be the set of trainable model parameters, and and are model’s training loss and validation loss, respectively. The two losses are determined by both the mixed dimension scheme , and the trainable parameters .
The goal of neural input search is to find a mixed dimension scheme that minimizes the validation loss , where the parameters given any mixed dimension scheme are obtained by minimizing the training loss.
This can be formulated as:
(3)  
s.t. 
The above problem formulation is actually consistent with hyperparameter optimization in a broader scope Maclaurin et al. (2015); Pedregosa (2016); Franceschi et al. (2018), since the mixed dimension scheme can be considered as model hyperparameters to be determined according to model’s validation performance. However, the main difference is that the search space in our problem is much larger than the search space of conventional hyperparameter optimization problems.
2.3 Feature Blocking
Feature blocking has been a novel ingredient used in the existing neural input search methods Joglekar et al. (2019); Ginart et al. (2019) to facilitate the reduction of search space. The intuition behind is that features with similar frequencies could be grouped into a block sharing the same embedding dimension. Following the existing works, we first employ feature blocking to control the search space of the mixed dimension scheme. We sort all the features in in the descending order of frequency (i.e., the number of feature occurrences in the training instances). Let denote the frequency of feature . We can obtain a sorted list of features such that for any . We then separate into blocks, where the features in a block share the same dimension index vector . We denote by the mixed dimension scheme after feature blocking. Then the length of the mixed dimension scheme becomes , and the search space size is reduced from to accordingly, where .
2.4 Continuous Relaxation and Differentiable Optimization
Continuous relaxation. After feature blocking, in order to optimize the mixed dimension scheme , we first transform into a binary dimension indicator matrix , where each element in is either 1 or 0 indicating the existence of the corresponding embedding dimension according to . We then introduce a soft selection layer to relax the search space of to be continuous. The soft selection layer is essentially a numerical matrix , where each element in satisfies: . That is, each binary choice (the existence of the th embedding dimension in the th feature block) in , is relaxed to be a continuous variable within the range of . We insert the soft selection layer between the feature embedding layer and interaction layers in the latent factor model, as illustrated in Figure 0(b). Given and the embedding matrix , the output embedding of a feature in the th block produced by the bottom two layers can be computed as follows:
(4) 
where is the th row in , and is the elementwise product. By applying Equation (4) to all the input features, we can obtain the output feature embeddings . Next, we supply to the feature interaction layers for final prediction as specified in Equation (2). Note that is used to softly select the dimensions of feature embeddings during model training, and the discrete mixed dimension scheme will be derived after training.
Differentiable optimization. Now that we relax the mixed dimension scheme (after feature blocking) via the soft selection layer , our problem stated in Equation (3) can be transformed into:
(5)  
s.t. 
where that represents model parameters in both the embedding layer and interaction layers. Equation 5 essentially defines a bilevel optimization problem Colson et al. (2007), which has been studied in differentiable NAS Liu et al. (2019a) and gradientbased hyperparameter optimization Chen et al. (2019); Franceschi et al. (2018); Pedregosa (2016). Basically, and are respectively treated as the upperlevel and lowerlevel variables to be optimized in an interleaving way. To deal with the expensive optimization of , we follow the common practice that approximates by adapting using a single training step:
(6) 
where is the learning rate for onestep update of model parameters . Then we can optimize based on the following gradient:
(7)  
where denotes the model parameters after onestep update. Equation (7
) can be solved efficiently using the existing deep learning libraries that allow automatic differentiation, such as Pytorch
Paszke et al. (2019). The secondorder derivative term in Equation (7) can be omitted to further improve computational efficiency considering to be near zero, which is called the firstorder approximation. In this paper, we adopt the firstorder approximation in DNIS by default since we find the final performance is similar with and without the approximation. Algorithm 1 (line 510) summarizes the bilevel optimization procedure for solving Equation (5).Gradient normalization. During the optimization of by the gradient , we propose a gradient normalization technique to normalize the rowwise gradients of over each training batch:
(8) 
where and denote the gradients before and after normalization respectively, and is a small value (e.g., 1e7) to avoid numerical overflow. The consideration is that the magnitude of the gradients of varies a lot over feature blocks due to the significant difference in feature frequency. By normalizing the gradients for each block, we can apply a single learning rate to different rows of during optimization. Otherwise, a single learning rate shared by different feature blocks may easily fall short in optimizing all the rows of .
2.5 Deriving Feature Embeddings in Mixed Dimensions
After optimization, we have the learned parameters for , and . This allows us to derive the discrete mixed dimension scheme . Specifically, for feature in the th block, we can compute its output embedding with and following Equation (4). By merging the embedding layer with the soft selection layer, we collect the output embeddings for all the features in and form an output embedding matrix . We then prune noninformative embedding dimensions in as follows:
(9) 
where is a threshold that can be manually tuned according to the requirements on model performance and computational resources. The pruned output embedding matrix is sparse and can be used to derive the discrete mixed dimension scheme and the embedding value vectors for accordingly.
Relation to network pruning.
Network pruning, as one kind of model compression techniques, improves the efficiency of overparameterized deep neural networks by removing redundant neurons or connections without damaging model performance
Cheng et al. (2017); Liu et al. (2019b); Frankle and Carbin (2019). Recent works of network pruning Han et al. (2015); Molchanov et al. (2017); Li et al. (2017) generally performed iterative pruning and finetuning over certain pretrained overparameterized deep network. Instead of simply removing redundant weights, our proposed method DNIS optimizes feature embeddings with the gradients from the validation set, and only prunes noninformative embedding dimensions and their values in one shot after model training. This also avoids manually tuning thresholds and regularization terms per iteration. We have conducted experiments to compare the performance of DNIS and network pruning methods in Section 3.4.3 Experiments
3.1 Experimental Settings
Datasets.
We used two benchmark datasets Movielens Harper and Konstan (2016)
and Criteo Labs (2014) for collaborative filtering (CF) and clickthrough rate (CTR) prediction tasks, respectively.
For each dataset, we randomly split the instances by 8:1:1 to obtain the training, validation and test sets. The statistics of the two datasets are summarized in Table 1.
(1) Movielens consists of more than 20 million user ratings ranging from 1 to 5 on different movies.
(2) Criteo is a popular industry benchmark dataset for CTR prediction, which contains 13 numerical feature fields and 26 categorical feature fields. Each label indicates whether a user has clicked the corresponding item.
Dataset  Task Type  Instance#  Field#  Feature# 

Movielens  Rating Prediction  20,000,263  2  165,771 
Criteo  CTR Prediction  45,840,617  39  2,086,936 
Evaluation metrics.
We adopt MSE (mean squared error) for rating prediction in CF, and use AUC (Area Under the ROC Curve) and Logloss (cross entropy) for CTR prediction. In addition to model performance, we also report the parameter size and the search cost of each method.
Comparison methods.
We compare our DNIS method with the following three approaches.
Grid Search
. This is the traditional approach to searching for a uniform embedding dimension. In our experiments, we searched 16 groups of dimensions, ranging from 4 to 64 with a stride of 4.
Random Search. Random search has been recognized as a strong baseline for NAS problems Liu et al. (2019a). When random searching a mixed dimension scheme, we applied the same feature blocking as we did for DNIS. Following the intuition that highfrequency features desire larger numbers of dimensions, we generated 16 random descending sequences as the search space of the mixed dimension scheme for each model and report the best results.
MDE (Mixed Dimension Embeddings Ginart et al. (2019)). This method performs feature blocking and applies a heuristic scheme where the number of dimensions per feature block is proportional to some fractional power of its frequency. We tested 16 groups of hyperparameters settings as suggested in the original paper and report the best results.
For DNIS, we show its performance before and after the dimension pruning in Equation (9), and report the storage size of the pruned sparse matrix using COO format of sparse matrix Virtanen et al. (2020). We show the results with different compression rates (CR), i.e., the division of unpruned embedding parameter size by the pruned size.
Implementation details. We implement our method using Pytorch Paszke et al. (2019). We apply Adam optimizer with the learning rate of 0.001 for model parameters and that of 0.01 for soft selection layer parameters . The minibatch size is set to 4096 and the uniform base dimension is set to 64 for all the models. We apply the same blocking scheme for random search, MDE and DNIS for a fair comparison. The default numbers of feature blocks is set to 10 and 6 for Movielens and Criteo datasets, respectively. We employ various latent factor models: MF, MLP He et al. (2017) and NeuMF He et al. (2017) for the CF task, and FM Rendle (2010), Wide&Deep Cheng et al. (2016), DeepFM Guo et al. (2017) for the CTR prediction, where the configuration of latent factor models are the same over different methods. Besides, we exploit earlystopping for all the methods according to the change of validation loss during model training. All the experiments were performed using NVIDIA GeForce RTX 2080Ti GPUs.
3.2 Comparison Results
Search Methods  MF  MLP  NeuMF  

Params  Time Cost  MSE  Params  Time Cost  MSE  Params  Time Cost  MSE  
(M)  (M)  (M)  
Grid Search  33  16  0.622  35  8  0.640  61  4  0.625 
Random Search  33  16  0.6153  22  4  0.6361  30  2  0.6238 
MDE  35  24  0.6138  35  5  0.6312  27  3  0.6249 
DNIS (unpruned)  37  1  0.6096  36  1  0.6255  72  1  0.6146 
DNIS ()  21  1  0.6126  20  1  0.6303  40  1  0.6169 
DNIS ()  17  1  0.6167  17  1  0.6361  32  1  0.6213 
Search Methods  FM  Wide&Deep  DeepFM  
Params  Time Cost  AUC  Logloss  Params  Time Cost  AUC  Logloss  Params  Time Cost  AUC  Logloss  
(M)  (M)  (M)  
Grid Search  441  16  0.7987  0.4525  254  16  0.8079  0.4435  382  14  0.8080  0.4435 
Random Search  73  12  0.7997  0.4518  105  16  0.8084  0.4434  105  12  0.8084  0.4434 
MDE  397  16  0.7986  0.4530  196  16  0.8076  0.4439  396  16  0.8077  0.4438 
DNIS (unpruned)  441  3  0.8004  0.4510  395  3  0.8088  0.4429  416  3  0.8090  0.4427 
DNIS ()  26  3  0.8004  0.4510  29  3  0.8087  0.4430  29  3  0.8088  0.4428 
DNIS ()  17  3  0.8004  0.4510  19  3  0.8085  0.4432  20  3  0.8086  0.4430 
Table 2 and Table 3 show the comparison results of different NIS methods on CF and CTR prediction tasks, respectively. First, we can see that DNIS achieves the best prediction performance over all the model architectures for both tasks. It is worth noticing that the improvement on training efficiency ranges from to over
. The results confirms that DNIS is able to learn discriminative feature embeddings with significantly higher efficiency than the existing search methods. Second, DNIS with dimension pruning achieves competitive or better performance than baselines, and can yield a significant reduction on model parameter size. For example, DNIS with a pruning rate (PR) of 2 outperforms all the baselines on Movielens, and yet reaches the minimal parameter size. The advantages of DNIS with the CR of 20 and 30 are more significant on Criteo. We observe that DNIS can achieve a higher CR on Criteo than Movielens without sacrificing prediction performance. This is because the distribution of feature frequency on Criteo is severely skewed, leading to a significantly large number of redundant dimensions for lowfrequency features. Third, among all the baselines, MDE performs the best on Movielens and Random Search performs the best on Criteo, while Grid Search gets the worst results on both tasks. This verifies the importance of applying mixed dimension embeddings to latent factor models. Note that all of the three baselines have searched over 16 groups of feature dimensions, and their time costs are slightly different due to the earlystopping of model training. Fourth, we find that MF achieves better prediction performance on the CF task than the other two model architectures. The reason may be the overfitting problem of MLP and NeuMF that results in poor generalization. Besides, DeepFM show the best results on the CTR prediction task, suggesting that the ensemble of DNN and FM is beneficial to improving CTR prediction accuracy.
3.3 Hyperparameter Investigation
We investigate the effects of two important hyperparameters and in DNIS. Figure 1(a) shows the performance change of MF w.r.t. different settings of . We can see that increasing is beneficial to reducing MSE. This is because a larger allows a larger search space that could improve the representations of highfrequency features by giving more embedding dimensions. Besides, we observe a marginal decrease in performance gain. Specifically, the MSE is reduced by 0.005 when increases from 64 to 128, whereas the MSE reduction is merely 0.001 when changes from 512 to 1024. This implies that may have exceeded the largest number of dimensions required by all the features, leading to minor improvements. Figure 1(b) shows the effects of the number of feature blocks . We find that increasing improves the prediction performance of DNIS, and the performance improvement decreases as becomes larger. This is because dividing features into more blocks facilitates a finergrained control on the embedding dimensions of different features, leading to more flexible mixed dimension schemes. Since both and affect the computation complexity of DNIS, we suggest to choose reasonably large values for and to balance the computational efficiency and predictive performance based on the application requirements.
3.4 Analysis on DNIS Results
is set to 10. (b) The joint distribution plot of feature embedding dimensions and feature frequencies after dimension pruning. (c) Comparison of DNIS and network pruning performance over different pruning rates.
We first study the learned feature dimensions of DNIS through the learned soft selection layer and feature embedding dimensions after dimension pruning. Figure 2(a) depicts the distributions of the trained parameters in for the 10 feature blocks on Movielens. Recall that the blocks are sorted in the descending order of feature frequency. It can be seen that the learned parameters in for the feature blocks with lower frequencies converge to smaller values, indicating that lowerfrequency features tend to be represented by smaller numbers of embedding dimensions. Figure 2(b) provides the number of embedding dimensions per feature after dimension pruning. The results show that features with higher frequencies end up with more embedding dimensions, whereas the dimensions are more likely to be pruned for lowfrequency features. Nevertheless, there is no strong correlation between the derived embedding dimension and the feature frequency. Note that the embedding dimensions for lowfrequency features scatter over a long range of numbers. This is consistent with the inferior performance of MDE which directly determines the number of feature embedding dimensions according to the frequency.
We further compare DNIS with network pruning method Han et al. (2015). For illustration purpose, we provide the results of the FM model on Criteo dataset. Figure 2(c) shows the performance of two methods on different pruning rates (i.e., the ratio of pruned embedding values). DNIS achieves better AUC and Logloss results than network pruning over all the pruning rates. This is because DNIS optimizes feature embeddings with the gradients from the validation set, which benefits the selection of predictive dimensions, instead of simply removing redundant weights in the embeddings.
4 Conclusion
In this paper, we introduced Differentiable Neural Input Search (DNIS), which searches for a mixed dimension scheme for different features adaptively from data. Instead of selecting from a predefined discrete set of candidate dimension schemes, DNIS is able to optimize embedding dimensions in a continuous search space with gradient descent. The key idea is to develop a soft dimension selection layer that controls the significance of each embedding dimension, and can be optimized with model’s validation performance through gradient descent. We show that DNIS can be seamlessly incorporated with various existing latent factor models for recommendation. We conduct extensive experiments on collaborative filtering and clickthrough rate prediction tasks, where DNIS outperforms the existing NIS baselines in terms of recommendation performance, training efficiency and parameter size.
References
 Designing neural network architectures using reinforcement learning. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.1.
 Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. See DBLP:conf/icml/2013, pp. 115–123. External Links: Link Cited by: §2.1.
 ProxylessNAS: direct neural architecture search on target task and hardware. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1.
 Searching for efficient multiscale architectures for dense image prediction. See DBLP:conf/nips/2018, pp. 8713–8724. External Links: Link Cited by: §2.1.
 opt: learn to regularize recommender models in finer levels. See DBLP:conf/kdd/2019, pp. 978–986. External Links: Link, Document Cited by: §2.4.
 Wide & deep learning for recommender systems. See DBLP:conf/recsys/2016dlrs, pp. 7–10. External Links: Link, Document Cited by: §2.1, §3.1.
 DELF: A dualembedding based deep latent factor model for recommendation. See DBLP:conf/ijcai/2018, pp. 3329–3335. External Links: Link, Document Cited by: §2.1.
 A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: §2.5.
 An overview of bilevel optimization. Annals OR 153 (1), pp. 235–256. External Links: Link, Document Cited by: §2.4.
 Deep neural networks for youtube recommendations. See DBLP:conf/recsys/2016, pp. 191–198. External Links: Link, Document Cited by: §1.
 Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. See DBLP:conf/ijcai/2015, pp. 3460–3468. External Links: Link Cited by: §2.1.
 Efficient multiobjective neural architecture search via lamarckian evolution. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1.
 Bilevel programming for hyperparameter optimization and metalearning. See DBLP:conf/icml/2018, pp. 1563–1572. External Links: Link Cited by: §2.2, §2.4.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.5.
 Mixed dimension embeddings with application to memoryefficient recommendation systems. arXiv preprint arXiv:1909.11810. Cited by: §1, §2.3, §3.1.
 DeepFM: A factorizationmachine based neural network for CTR prediction. See DBLP:conf/ijcai/2017, pp. 1725–1731. External Links: Link, Document Cited by: §2.1, §3.1.
 Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §2.5, §3.4.
 The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4), pp. 19:1–19:19. External Links: Link, Document Cited by: §3.1.
 Neural collaborative filtering. See DBLP:conf/www/2017, pp. 173–182. External Links: Link, Document Cited by: §2.1, §3.1.
 Neural input search for large scale recommendation models. arXiv preprint arXiv:1907.04471. Cited by: §1, §1, §2.3.
 External Links: Link Cited by: §3.1.
 Pruning filters for efficient convnets. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.5.
 Random search and reproducibility for neural architecture search. See DBLP:conf/uai/2019, pp. 129. External Links: Link Cited by: §2.1.
 XDeepFM: combining explicit and implicit feature interactions for recommender systems. See DBLP:conf/kdd/2018, pp. 1754–1763. External Links: Link, Document Cited by: §2.1.
 DARTS: differentiable architecture search. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1, §2.4, §3.1.
 Rethinking the value of network pruning. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.5.
 Gradientbased hyperparameter optimization through reversible learning. See DBLP:conf/icml/2015, pp. 2113–2122. External Links: Link Cited by: §2.2.
 Towards automaticallytuned neural networks. See DBLP:conf/icml/2016automl, pp. 58–65. External Links: Link Cited by: §2.1.

Designing neural networks using genetic algorithms
. See DBLP:conf/icga/1989, pp. 379–384. Cited by: §2.1. 
Pruning convolutional neural networks for resource efficient inference
. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.5.  Deep learning inference in facebook data centers: characterization, performance optimizations and hardware implications. arXiv preprint arXiv:1811.09886. Cited by: §1.
 PyTorch: an imperative style, highperformance deep learning library. See DBLP:conf/nips/2019, pp. 8024–8035. External Links: Link Cited by: §2.4, §3.1.
 Hyperparameter optimization with approximate gradient. See DBLP:conf/icml/2016, pp. 737–746. External Links: Link Cited by: §2.2, §2.4.

Regularized evolution for image classifier architecture search
. See DBLP:conf/aaai/2019, pp. 4780–4789. External Links: Link, Document Cited by: §2.1.  Largescale evolution of image classifiers. See DBLP:conf/icml/2017, pp. 2902–2911. External Links: Link Cited by: §2.1.
 Factorization machines. See DBLP:conf/icdm/2010, pp. 995–1000. External Links: Link, Document Cited by: §2.1, §3.1.
 SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Document Cited by: §3.1.
 SNAS: stochastic neural architecture search. See DBLP:conf/iclr/2019, External Links: Link Cited by: §2.1.
 AutoEmb: automated embedding dimensionality search in streaming recommendations. arXiv preprint arXiv:2002.11252. Cited by: §1.
 Practical blockwise neural network architecture generation. See DBLP:conf/cvpr/2018, pp. 2423–2432. External Links: Link, Document Cited by: §2.1.
 Neural architecture search with reinforcement learning. See DBLP:conf/iclr/2017, External Links: Link Cited by: §2.1.
 Learning transferable architectures for scalable image recognition. See DBLP:conf/cvpr/2018, pp. 8697–8710. External Links: Link, Document Cited by: §2.1.
Comments
There are no comments yet.