I Introduction
With the rapid development of the Internet and mobile devices, our daily activities connect closely to online services, such as online shopping, online news and videos, online social networks, and many more. Recommender systems are a powerful information filter for guiding users to find the items of interest in a gigantic and rapidly expanding pool of candidates. As elaborated in [1]
, if an IR system’s response to each query is a ranking of documents in decreasing order of probability of relevance, the overall effectiveness of the system to its user will be maximized. With this principle, the prediction of clickthrough rate (CTR) is crucial for recommender systems, where the task is to estimate the probability a user will click on a recommended item. In many recommender systems, the goal is to maximize the number of clicks, and the items returned to a user are ranked by the estimated CTR; while in other application scenarios such as online advertising it is also important to improve revenue, so the ranking strategy can be adjusted accordingly, such as by CTR
bid with “bid” being the benefit the system receives once the item is clicked. In either case, the key is in estimating CTR precisely.It is important for CTR prediction to learn implicit feature interactions behind user click behaviors. By our study in a mainstream apps market, we found that people often download apps for food delivery at mealtime, suggesting that the (order2) interaction between app category and timestamp can be used as a signal for CTR prediction. As a second observation, male teenagers like shooting games and RPG games, which means that the (order3) interaction of app category, user gender and age is another signal for CTR prediction. In general, such interactions of features behind user click behaviors can be highly sophisticated, where both low and highorder feature interactions should play important roles. According to the insights of the Wide & Deep model [2] from Google, considering low and highorder feature interactions simultaneously brings additional improvement over the cases of considering either alone.
The key challenge is in effectively modeling feature interactions. Some feature interactions can be easily understood and engineered by experts (like the instances above). However, most other feature interactions are hidden in data and difficult to identify a priori (for instance, the classic association rule “diaper and beer” is mined from a large amount of data, instead of being discovered by experts), which can only be captured automatically
by machine learning. Even for easytounderstand interactions, it seems unlikely for experts to model them exhaustively, especially when the number of raw features is huge.
Despite their simplicity, generalized linear models, such as FTRL [3], have shown decent performance in practice. However, a (generalized) linear model lacks the ability to learn feature interactions, and a common practice is to manually include pairwise feature interactions in designing its feature vector. Such a method is hard to generalize to model highorder feature interactions or those never or rarely appear in the training data [4]. Factorization Machines (FM) [4] model pairwise feature interactions as inner product of latent vectors between features and show very promising results. While in principle FM can model highorder feature interactions, in practice usually only order2 feature interactions are considered due to high complexity.
As a powerful approach to learning feature representations, deep neural networks have shown the potential to learn sophisticated feature interactions automatically. Some ideas extend CNN and RNN for CTR prediction [5, 6], but CNNbased models are biased to the interactions between neighboring features while RNNbased models are more suitable for click data with sequential dependency. [8] studies feature representations and proposes Factorizationmachine supported Neural Network (FNN). This model pretrains FM before applying DNN, thus limited by the capability of FM. Feature interaction is studied in [9], by introducing a product layer between embedding layer and fullyconnected layers, and proposing the Productbased Neural Network (PNN). As noted in [2], PNN and FNN, like other deep models, capture little loworder feature interactions, which are also essential for CTR prediction. To model both low and highorder feature interactions, [2] proposes an hybrid network structure (Wide & Deep) that combines a linear (“wide”) model and a deep model. In this hybrid model, two different inputs are required for the “wide” part and “deep” part, respectively, and the input of “wide” part still relies on feature engineering from domain experts.
One can see that existing models are biased to low or highorder feature interactions, or rely on feature engineering. In this paper, we develop a learning framework that is able to learn feature interactions of all orders in an endtoend manner, without any feature engineering besides raw features. Our main contributions are summarized as follows:

[leftmargin=*]

We propose a new neural network framework DeepFM (left part of Figure 1) that integrates the architectures of FM and deep neural networks in a way that models loworder feature interactions like FM and models highorder feature interactions like deep neural networks. Unlike the wide & deep model [2], DeepFM can be trained endtoend without any feature engineering. There is no implicit requirement on the network structure of the “deep” part of DeepFM, so various deep architectures are plausible^{1}^{1}1In all figures of this paper, a Normal Connection in black refers to a connection with weight to be learned; a Weight1 Connection, red arrow, is a connection with weight 1 by default; Embedding, blue dashed arrow, means a latent vector to be learned; Addition means adding all input together; Product, including Inner and OuterProduct, means the output of this unit is the product of two input vector; Sigmoid Function is used as the output function in CTR prediction; Activation Functions
, such as relu and tanh, are used for nonlinearly transforming the signal;The yellow and blue circles in the sparse features layer represent one and zero in onehot encoding of the input, respectively.
. 
DeepFM can be trained efficiently because its “wide” part and “deep” part share the same input and also the embedding vector. In contrast, the input vector to Google’s Wide & Deep model [2] can be of huge size as it includes manually designed pairwise feature interactions in its wide part, which also greatly increases its complexity. Moreover, no expertise feature engineering is needed in our DeepFM framework, while it is required in [2], as the performance of linear models relies heavily on feature engineering. We study two instances of DeepFM in detail, namely DeepFMD and DeepFMP (see the right part of Figure 1), where the “deep” part of the DeepFM framework is DNN and PNN, respectively.

We evaluate DeepFMD and DeepFMP on both benchmark data and commercial data, which shows consistent improvement over existing models for CTR prediction.

To investigate whether DeepFM is well suited in a production environment as a further exploration of our previous work [10], we conduct the online A/B test in Huawei App Market. The results reveal that DeepFMD model leads to more than 10% improvement of CTR, compared to a wellengineered LR model. We also covered our practice in deploying our framework, such as multiGPU data parallelism and asynchronous data reading. Extensive experiments are conducted to show the effectiveness of our proposed techniques.
The rest of the paper is organized as follows. Section II gives an overview of some related works on recommender systems. Section III presents our DeepFM framework, as well as two instances DeepFMD and DeepFMP of our framework. In Section IV, We extensively evaluate the effectiveness and efficiency of DeepFMD and DeepFMP on benchmark datasets and commercial dataset, and we conduct online A/B test in Huawei App Market. Finally, we conclude our work in Section V.
Ii Related Work
In this section, we review three important categories of models in recommender systems, namely collaborative filtering models, linear models and deep learning models.
Iia Collaborative Filtering in Recommender Systems
The collaborative filtering (CF)based models are well studied for recommender systems from the last decade. The basic assumption is that users with similar past behaviors will like the same kind of items, and the items attracting similar users will share similar ratings from a user. Collaborative filtering models consist of memorybased and modelbased methods. The memorybased methods directly define the similarities between pairs of users and pairs of items, which are used to calculate the ratings of unknown useritem pairs [11]. Although it is easy to implement and explain, there are several limitations for memorybased methods. For instance, the similarity values are unreliable when the data is sparse and the common items are few. As a complementary part, modelbased methods define a model to fit the known useritem interactions and predict the rating of unknown useritem pairs using the learned model. The most widely used modelbased model is matrix factorization (MF) [12, 13]. Based on the low rank assumption and the observed useritem interactions, MF models characterize both items and users by vectors in the same space and predict an unknown rating of a useritem pair relies on the item and user vectors.
Different from CFbased models, the contentbased models rely on user portrait or product information [14]. However, both CFbased and contentbased models have limitations. While the former does not explicitly incorporate the users’ and items’ feature information, the latter does not necessarily consider the information in preference similarity across individuals. Therefore hybrid methods have gained popularity in recent years. In order to incorporate useritem interaction information and auxiliary information together, such as text, temporal information, location information and so on, several hybrid methods [15, 16, 17] are proposed. By learning side information and useritem interaction simultaneously, hybrid methods are able to alleviate coldstart problem and give the better recommendation.
Above models are not applied in CTR prediction very often, for reasons such as poor scalability, unsatisfactory performance on sparse data.
IiB Linear Models in Recommender Systems
Because of the robustness and efficiency, generalized Logistic Regression (LR) models, such as FTRL
[3], are widely used in CTR prediction. To learn feature interactions, a common practice is to manually include pairwise feature interactions in its feature vector. The Poly2 [18] model is proposed to model all order2 feature interactions to avoid feature engineering. Factorization Machines (FM) [4] adopts a factorization technique to improve the ability of learning feature interactions when data is very sparse. Recently, to model interaction features from different fields, the authors of FFM [19] introduces field information into FM model. LR, Poly2 and FM variants are widely used in CTR prediction in industry. In addition, a few other models are also proposed for CTR prediction, such as LR+GBDT model [20], tensor based model
[21], and bayesian model [22].IiC Deep Learning in Recommender Systems
Due to the powerful ability of feature learning, deep learning models have achieved great success in various areas, such as computer vision
[23][24], audio recognition [25] and gaming [26]. In order to take advantage of its feature learning ability to enhance recommender systems, several deep learning models are proposed for recommendation (e.g., [27, 28, 29, 30, 31, 6, 32, 9, 8, 5]). These works can be divided into CTRbased and ratingbased models.CTRbased models [9, 8, 5, 6] are already mentioned in Section I and some of them will be discussed again in Section III, therefore we omit them here.
There are two kinds of ratingbased deep models, CFbased and hybrid models. The CFbased models, such as [28, 33, 34, 30], are proposed to improve Collaborative Filtering via deep learning. For instance, [33] and [28] complete the rating matrix by auto encoder [35]
and restricted boltzmann machine, respectively. Unlike CFbased models, the hybridbased models
[36, 29, 32] use deep learning to learn features of various domains. Specifically, [31] proposes a recurrent recommender system, which is able to capture temporal information and predict future behavioral trajectories. [32] utilizes both review information and useritem interactions. For the purpose of learning better features, [36] designs a novel endtoend model to learn features from audio content and useritem interactions simultaneously to make personalized recommendations. In order to ease the cold start problem when recommending new and unpopular songs, [29]adds deep convolution neural network in the latent factor framework to learn audio features better.
Several models are proposed in industry. Google develops a twostage deep learning framework for YouTube video recommendation [27]. In order to learn the relationship between image features and other features, Alibaba proposes an endtoend deep model [7], which incorporates convolutional neural network for learning image features and multilayer perception for other features, for CTR prediction. Readers interested in deep learning models in recommender systems can refer to a comprehensive survey work [37].
Iii Our Approach
Suppose the training data consists of instances , where is an fields data record usually involving a pair of user and item, and is the associated label indicating user click behaviors ( means the user clicked the item, and otherwise). may include categorical fields (e.g., gender, location) and continuous fields (e.g., age). Each categorical field is represented as a onehot vector, and each continuous field is represented as the value itself, or a onehot vector after discretization. Thus, each instance is converted to where is a dimensional vector, with being the vector representation of the field of . Normally, is highdimensional and extremely sparse. The task of CTR prediction is to develop a prediction model to estimate the probability of a user clicking a specific item in a given context.
Iiia DeepFM
To learn both low and highorder feature interactions, we propose an endtoend deep learning framework for CTR prediction, namely FactorizationMachine based neural network (DeepFM). As depicted in Figure 1, DeepFM consists of two components, FM Component and Deep Component, that share the same input. For feature , a scalar is used to weigh its order1 importance, a latent vector is used to measure its impact of interactions with other features. is fed in FM component to model order2 feature interactions, and fed in deep component to model highorder feature interactions. All parameters, including , , and the network parameters (, below) are trained jointly for the combined prediction model:
(1) 
where is the predicted CTR, is the output of FM component, and is the output of deep component. We also present two instances of DeepFM framework in Figure 1, namely DeepFMD and DeepFMP, whose deep component is DNN and PNN, respectively. The prediction formulae of DeepFMD and DeepFMP update Equation 1 by setting and , respectively. The definitions of , and are introduced in the following sections.
IiiA1 FM Component of DeepFM
The FM component is a factorization machines, which is proposed in [4] to learn feature interactions for recommendation. Besides a linear (order1) interactions among features, FM models pairwise (order2) feature interactions as inner product of respective feature latent vectors. It can capture order2 feature interactions much more effectively than previous approaches (such as LR and Poly2 [18]) especially when the data is sparse. For comparison, we show the prediction models for LR and Poly2:
(2) 
(3) 
where and are the prediction of LR model and Poly2 models, and is a function encoding and into a natural number. LR models linear combination of the features and some order2 feature interactions that are selected by experts (i.e., and are picked by human). To avoid feature engineering, Poly2 chooses to model all possible order2 feature interactions. In these two approaches, each feature interaction is assigned with a parameter, so that the number of parameters in the model is huge. Moreover, the parameter of an interaction of features and can be learned only when feature and feature both appear in a sample.
While in FM, it is measured via the inner product of their latent vectors and . Thanks to this flexible design, FM can train latent vector () whenever (or ) appears in a data record. Therefore, feature interactions, which are never or rarely appeared in the training data, are better learned by FM. As Figure 2 shows, the output of FM is the summation of an Addition unit and a number of Inner Product units:
(4) 
where and ( is given)^{2}^{2}2We omit a constant offset for simplicity.. The Addition unit () reflects the importance of order1 features, and the Inner Product units represent the impact of order2 feature interactions. As presented in Equation 1, the output of FM component is part of the final CTR prediction.
IiiA2 Deep Component of DeepFM
The deep component is a feedforward neural network, which is used to learn highorder feature interactions. A data record (a vector) is fed into the neural network. Compared to neural networks with image
[23] or audio [25] data as input, which is purely continuous and dense, the input of CTR prediction is quite different, which requires a new network architecture design. Specifically, the raw feature input vector for CTR prediction is usually highly sparse^{3}^{3}3Only one entry is nonzero for each field vector., super highdimensional^{4}^{4}4E.g., in an app store of billion users, the one field vector for user ID is already of billion dimensions., categoricalcontinuousmixed, and grouped in fields (e.g., gender, location, age). This suggests an embedding layer to compress the input vector to a lowdimensional, dense realvalue vector before further feeding into the first hidden layer, otherwise the network can be overwhelming to train.Figure 3 highlights the network structure from the input layer to the embedding layer. We would like to point out two interesting designs of this network structure: 1) while input field vectors can be of different sizes, their embeddings are of the same size (); 2) the latent feature vectors () in FM now serve as network weights which are learned and used to compress the input field vectors to the embedding vectors. In [8], is pretrained by FM and used as initialization. In this work, rather than using the latent feature vectors of FM to initialize the networks as in [8], we include the FM model as part of our overall learning architecture. As such, we eliminate the need of pretraining by FM and instead jointly train the entire network in an endtoend manner.
Denote the output of the embedding layer as:
(5) 
(6) 
where is the embedding of the th field, is the number of fields, is the parameters between the embedding layer and the input layer of the field (as shown in Figure 3), is the onehot vector of the field raw input data.
It is worth pointing out that FM component and deep component share the same feature embedding, which brings two important benefits: 1) it learns both low and highorder feature interactions from raw features; 2) there is no need for expertise feature engineering of the input, as required in Google Wide & Deep model [2].
Note that in the proposed DeepFM framework, there is no implicit requirement on the network structure of the deep component. In this section, we show only a general deep component of DeepFM. In the next sections, we will present in detail the network structure of the deep component of two instances of the DeepFM framework, called DeepFMD and DeepFMP.
IiiA3 Deep Component of DeepFMD Model
The deep component of DeepFMD is a fullyconnected deep neural network (DNN, or equivalently Multilayer Perceptron). The structure of DNN is presented in Figure
4.In such network structure, output of the embedding layer is fed into the deep neural network, and the forward process is:
(7) 
where is the layer depth and is an activation function. , , are the output, model weight, and bias of the th layer, respectively. After going through several hidden layers, a dense realvalue feature vector is generated as,
(8) 
where is the number of hidden layers. This feature vector is finally fed into the sigmoid function for CTR prediction, as described in Equation 1.
IiiA4 Deep Component of DeepFMP Model
The deep component of DeepFMP is a product based neural network (PNN) [9]. As presented in [9], there are three variants of PNN models, i.e., IPNN, OPNN, PNN. PNN introduces a product layer between the embedding layer and the first hidden layer (the middle part of Figure 8). The three variants differ from each other in defining different product operations between features as feature interactions. More specifically, IPNN uses inner product, OPNN uses outer product, and PNN uses both inner and outer product. The details of three PNN variants are presented as follows.
In the product layer of PNN model (the middle part of Figure 8
), it consists of two parts: linear neurons (right part of the layer) and quadratic neurons (“product symbol” in the left part of the layer). Linear neurons are the concentration of the embedding vectors of all fields, while quadratic neurons are the products of embedding vectors from a pair of fields. The output unit of the first hidden layer is
(9) 
In Equation 9, is the output vector of linear neurons in the product layer, which is embedding vectors of different fields themselves. is the output vector of quadratic neurons, which includes the interactions between any two embedding vectors and . are parameters between the product layer and the first hidden layer connecting to linear neurons and quadratic neurons respectively. The three variants of PNN define function differently: IPNN and OPNN define to be the inner product and outer product of two vectors respectively, while PNN considers both inner and outer product. Finally, going through several fullyconnected hidden layers (as defined in Equation 7), has a similar output value as (as defined in Equation 8).
IiiB Practical Issues
IiiB1 Learning

[leftmargin=1em]

Objective function: In the domain of CTR prediction, the most commonly used objective function is Logloss, which is equivalent to the KL divergence between two distributions:
(10) (11) where is a data instance, is the feature vector and is the label, is the prediction of the instance , is the Logloss of , and is the Logloss of dataset .

Overfitting: In machine learning, one important issue is to prevent overfitting. An overfit model has poor performance since it overreacts to the given training data. The authors of [38] state that FM may suffer from overfitting and thus they utilize L2norm to regularize the objective function. On the other hand, as a complicated model, neural networks are also easy to overfit. The authors of [39] propose a simple strategy to prevent neural networks from overfitting, which is known as dropout. Therefore, in our DeepFM framework, we adopt L2norm to regularize the FM component and adopt dropout for the deep component.
IiiB2 Accelerating Strategy

[leftmargin=1em]

MultiGPU Architecture: In real applications, the amount of training data is so huge that the training process has to take a long time. For the purpose of accelerating this process, we utilize the multiGPU data parallelism when deploying DeepFM in the production environment (as presented in Figrue 5). At first, we split a batch of data records into Num_GPU pieces (Num_GPU is the number of GPU cards) and feed each piece into different GPU cards simultaneously. Then, the gradient of data records in different pieces is computed by individual GPU card. After that, the gradient is collected and averaged. Finally, the model parameters are updated by the averaged gradient.
We evaluate the effectiveness of multiGPU data parallelism for DeepFMD model on Company data set, and the test curves of AUC and Logloss related with different settings are shown in Figure 6. Specifically, the batch size of different settings are same, the learning rate of 1GPU, 4GPU and 4GPUA are 0.0001, 0.0001 and 0.0001, respectively. Compared with 1GPU, the test curves of 4GPU indicate that the training process of 4GPU is slower. It is because the number of updates of 4GPU is only a quarter of 1GPU when adopting same . As a result, it converges slower than 1GPU if we set same
. In fact, the variance of the gradient in a minibatch can be denoted as following,
where is the gradient of a randomly selected instance. The reason for the second equal sign is that the variance of the gradient related to the randomly selected instances is equal to each other [40]. So the variance of the gradient decreases times when we increase the batch size by times. In other word, the gradient becomes more accurate. Then we can increase to accelerate the training process. Add into the equation of the gradient’s variance:
Therefore, when using 4 GPU cards, we can increase the value of by times. As a result, the learning curve of 4GPUA in Figure 6 is similar as that of 1GPU.

Asynchronous Data Reading: Training a neural network is usually in a minibatch style, in which a minibatch of data records (e.g., several thousand) are read and fed into the neural network, and then model parameters are updated in a forwardbackward way. There are two possible ways to handle data reading and model updating: sequential and parallel, as shown in Figure 7. In the sequential approach, data reading and model updating are processed interleaved and sequentially, i.e., model updating starts when the current minibatch of data is read and fed, and the next minibatch of data will not be read until the model finishes updating with the current minibatch of data. Obviously, this is not an efficient way. We propose a parallel manner to handle data reading and model updating, namely, asynchronous data reading. A thread is created to read the data records regardless of the model updating, so that the model parameters keep updating without interrupted by reading data.

Efficiency Validation Experiment:
As shown in Table I, we record the speed up rate^{5}^{5}5In this paper, we define the speed up rate of strategy A over strategy B to be the processing time of strategy B divided by the process time of strategy A. of the 4GPU over 1GPU data parallelism, asynchronous (Asyn) over synchronous (Syn) data reading of DeepFM models on Company and Criteo data sets. We omit the validation result on CriteoSequential data set, since CriteoSequential and CriteoRandom come from the same original data set with different splitting strategies.
Company CriteoRandom 4GPU over 1GPU 2.25 X 2.15 X Asyn over Syn 1.12 X 1.19 X TABLE I: Speed up rate for DeepFMD model.
IiiC Relationship with Other Neural Networks
Inspired by the enormous success of deep learning in various applications, several deep models for CTR prediction are developed recently. This section compares the proposed DeepFMD and DeepFMP models with existing deep models for CTR prediction.
IiiC1 Fnn
As Figure 8 (left) shows, FNN is an FMinitialized feedforward neural network [8]. The FM pretraining strategy results in two limitations: 1) the embedding parameters might be over affected by FM; 2) the efficiency is reduced by the overhead introduced by the pretraining stage. In addition, FNN captures only highorder feature interactions. In contrast, DeepFMD and DeepFMP models need no pretraining and learn both high and loworder feature interactions in an endtoend manner.
There is a detailed issue about embedding vectors among different models that needs to be metioned. As shown in Figure 1 of [8], each embedding vector of a field includes both latent vector of this field and an additional neuron representing the weights of the feature values in this field. In other words, if the dimension of latent vectors in FM is , then the latent vectors are of size in FNN. In our experiments, in order to keep the same representation ability, PNN and Wide & Deep use the same size of embedding vectors as FNN. On the other side, DeepFM models have an FM component to model the weights of individual feature values, therefore there is no need to include an additional neuron in the embedding vector as FNN does. Due to this reason, the size of embedding vectors in DeepFM models is smaller than that in FNN by one.
IiiC2 Pnn
For the purpose of capturing highorder feature interactions, PNN imposes a product layer between the embedding layer and the first hidden layer [9]. According to different types of product operations, there are three variants: IPNN, OPNN, and PNN, where IPNN is based on inner product of vectors, OPNN is based on outer product, and PNN is based on both inner and outer products. To make the computation more efficient, the authors proposed the approximated computations of both inner and outer products: 1) the inner product is approximately computed by eliminating some neurons; 2) the outer product is approximately computed by compressing dimensional feature vectors to one dimensional vector. However, we find that the outer product is less reliable than the inner product, since the approximated computation of outer product loses much information that makes the result unstable. Although inner product is more reliable, it still suffers from high computational complexity, because the output of the product layer is connected to all neurons of the first hidden layer. Like FNN, all PNNs ignore loworder feature interactions.
IiiC3 Wide & Deep
Wide & Deep (Figure 8 (right)) is proposed by Google to model low and highorder feature interactions simultaneously. As shown in [2], there is a need for expertise feature engineering on the input to the “wide” part (for instance, crossproduct of users’ install apps and impression apps in app recommendation). In contrast, DeepFMD and DeepFMP need no such expertise knowledge to handle the input by learning directly from the input raw features.
A straightforward extension to this model is replacing LR by FM (we also evaluate this extension in Section IV). This extension is similar to DeepFMD, but DeepFM framework shares the feature embedding between the FM component and deep component. The sharing strategy of feature embedding influences (in backpropagate manner) the feature representation by both low and highorder feature interactions, which models the representation more precisely.
IiiC4 Summarizations
To summarize, the relationship between DeepFM framework and the other deep models in four aspects is presented in Table II. DeepFM is the only framework that requires no pretraining and no feature engineering, and captures both low and highorder feature interactions.
No  Highorder  Loworder  No Feature  

Pretraining  Features  Features  Engineering  
FNN  
PNN  
Wide & Deep  
DeepFMD (P) 
Iv Experiments
In this section, we conduct both offline and online experiments to evaluate our proposed DeepFM framework.
In the offline experiments, we compare two instances of our proposed DeepFM framework (namely, DeepFMD and DeepFMP) with the other stateoftheart models empirically. The evaluation result indicates that DeepFMD and DeepFMP are more effective than any other stateoftheart model. The efficiency tests of DeepFMD, DeepFMP and the other baseline models are also performed.
In the online experiments, we conduct a consecutive seven days’ A/B test to evaluate the performance of DeepFM framework. In these DeepFM models, DeepFMD has a relative better efficiency and performance. Therefore, we adopt DeepFMD as our model to compare with a wellengineered LR model, which is one of the most popular CTR prediction model in industry. In addition, to understand the result of A/B test better, we compare the recommendation lists generated by LR and DeepFMD through the online simulation experiment.
Iva Setup of Offline Experiments
IvA1 Data sets
We evaluate the effectiveness and efficiency of our DeepFMD and DeepFMP models on the following three data sets.

[leftmargin=*]

Criteo Data set: Criteo data set^{6}^{6}6http://labs.criteo.com/downloads/2014kaggledisplayadvertisingchallengedataset/ includes 45 million users’ click records. There are 13 continuous features and 26 categorical ones. We split the data set in two different ways: randomly and sequentially, resulting in CriteoRandom and CriteoSequential. To get CriteoRandom data set, the original Criteo data set is randomly split into two parts as: 9/10 is for training, while the rest 1/10 is for testing. To get CriteoSequential data set, the original data set is split sequentially as: the first 6/7 is for training, while the rest 1/7 is for testing, as the original data set consists of data instances of 7 consecutive days. In CriteoSequential data set, information is not leaked, however data is significantly biased between training set and test set, as training set contains only six days’ records instead of one week’s records. On the contrary, in CriteoRandom data set, information may be leaked but it is not significant biased between training set and test set.

Company Data set: Company data set is a commercial industrial data set. We collect 8 consecutive days of users’ click records from the game center of the Company App Store: the first 7 days’ records for training, and the next 1 day’s records for testing. There are around 1 billion records in the whole collected dataset. In this dataset, there are app features (e.g., identification, category, and etc), user features (e.g., user’s downloaded apps, and etc), and context features (e.g., operation time, and etc).
IvA2 Evaluation Metrics
We use two evaluation metrics in our experiments:
AUC (Area Under ROC Curve) and Logloss (Logistic loss).AUC and Logloss are two of the most commonly used evaluation metrics for binaryclass classification problem. For such machine learning models of binaryclass classification, the prediction is a probability value that the given data record belongs to a certain class. AUC and Logloss are more suitable than precision and recall, it is because that when computing precision and recall, a userdefined threshold is needed to convert a probability value to a class label and the choice of the threshold value affects the accuracy and recall significantly. However, AUC and Logloss avoid such userdefined threshold values.
AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming positive ranks higher than negative)
[41]. Logloss (or Cross Entropy) measures the distance between two distributions, one of which is predicted by the model while the other is given by the labels of the data instances. Note that Logloss is the objective function of our proposed model at the same time. The formula of Logloss is presented in Equation 11.IvA3 Model Comparison
In our experiment, we compare the performance of 12 models, which are divided into four categories: Wide models, Deep models, Wide & Deep models and DeepFM models.

[leftmargin=*]

Wide models: LR, FM.

Deep models: DNN, FNN, PNN. There are three variants of PNN, namely IPNN, OPNN and PNN.

Wide & Deep models: The original Wide & Deep model is discussed in Section IIIC. For the purpose of eliminating feature engineering effort, we adapt the original Wide & Deep model by replacing LR by FM as the wide part. In order to distinguish these two variants of Wide & Deep, we name them LR & DNN and FM & DNN, respectively.^{7}^{7}7We do not use the Wide & Deep API released by Google, as the efficiency of that implementation is very low. We implement Wide & Deep by ourselves by simplifying it with shared optimizer for both deep and wide part.

DeepFM models: DeepFMD and DeepFMP. Our DeepFMP also has three variants, denoted as DeepFMIP, DeepFMOP and DeepFMP, of which the deep components are the three variants of PNN accordingly.
IvA4 Parameter Settings
To achieve the best performance for each individual model on CriteoRandom, CriteoSequential and Company data sets, we conducted carefully parameter study on all the data sets. Due to the space limit, we only discuss the parameter study on Company data set, which is presented in Section IVC. The hyperparameters of compared deep models on CriteoSequential and CriteoRandom data sets are stated in Table III, where the activation function, dropout and structure of hidden layers are given. The optimizer for LR and others are FTRL [3] and Adam [42], respectively. The embedding dimensions of FM and DeepFM models are 10, and others are 11 (discussed in Section IIIC). Note that hyperparameters of baseline models on CriteoRandom data set follow the setting in [9], and we keep the deep components of DeepFM models with the same setting to validate the superiority of our models.
Model  CriteoRandom  CriteoSequential 

DNN  relu;0.5;400400400.  relu;0.8;800800800800800. 
FNN  relu;0.5;400400400.  relu;0.9;11001000900800700600500. 
IPNN  tanh;0.5;400400400.  tanh;0.8;800800800. 
OPNN  relu;0.5;400400400.  relu;0.9;800800800800800. 
PNN  relu;0.5;400400400.  relu;0.8;800800800800800. 
LR & DNN  relu;0.5;400400400.  relu;1.0;11001000900800700600500. 
FM & DNN  relu;0.5;400400400.  tanh;0.7;1000900800700600. 
DeepFMD  relu;0.5;400400400.  relu;0.9;800800800800800800800. 
DeepFMIP  relu;0.5;400400400.  relu;0.8;800800800. 
DeepFMOP  relu;0.5;400400400.  relu;0.9;800800800800800. 
DeepFMp  relu;0.5;400400400.  relu;0.8;800800800800800. 
IvB Performance of Offline Evaluation
In this section, we evaluate the efficiency and effectiveness of the models listed in Section IVA3 on the three data sets.
IvB1 Efficiency Comparison
The efficiency of deep learning models is important to realworld applications. We compare the efficiency of different models on Company data set by the following formula: , which is the normalized running time by LR model. The results are shown in Figure 9, including the tests on CPU and GPU, where we have the following observations:

[leftmargin=*]

Pretraining of FNN makes it less efficient, especially on GPU, since the pretraining by FM model is not suitable for accelerating by GPU.

IPNN, PNN, DeepFMIP and DeepFMP are the least efficient models, on both CPU and GPU. Although their speed up on GPU is higher than the other models, they are still computationally expensive because of the inefficient inner product operations.

DNN and OPNN are the most efficient models, on both CPU and GPU. Testing on GPU shows a much more obvious gap between these two models and other models.

FNN, FM & DNN, LR & DNN, DeepFMD and DeepFMOP have similar efficiency on both CPU and GPU.
In real industry applications, we are equipped by powerful servers with GPUs. The efficiency of DeepFMD and DeepFMOP is acceptable for us, since they are only 41% and 14% slower than LR model on GPU.
IvB2 Effectiveness Comparison
AUC  LogLoss  

Company  CriteoRandom  CriteoSequential  Company  CriteoRandom  CriteoSequential  
LR  0.8641  0.7804  0.7777  0.02648  0.46782  0.4794 
FM  0.8679  0.7894  0.7843  0.02632  0.46059  0.4739 
DNN  0.8650  0.7860  0.7953  0.02643  0.4697  0.4580 
FNN  0.8684  0.7959  0.8038  0.02628  0.46350  0.4507 
IPNN  0.8662  0.7971  0.7995  0.02639  0.45347  0.4543 
OPNN  0.8657  0.7981  0.8002  0.02640  0.45293  0.4536 
PNN  0.8663  0.7983  0.8005  0.02638  0.453305  0.4533 
LR & DNN  0.8671  0.7858  0.7973  0.02635  0.46596  0.4565 
FM & DNN  0.8658  0.7980  0.7985  0.02639  0.45343  0.4551 
DeepFMD  0.8715  0.8016  0.8048  0.02619  0.44985  0.4497 
DeepFMIP  0.8720  0.8019  0.8019  0.02616  0.4496  0.4525 
DeepFMOP  0.8713  0.8008  0.8020  0.02619  0.4510  0.4524 
DeepFMP  0.8716  0.7995  0.8015  0.02619  0.4515  0.4530 
The performance (in terms of AUC and Logloss) of the compared models on CriteoRandom data set, CriteoSequential data set and Company data set is presented in Table IV (the values in the table are averaged by 5 runs, and the variances of AUC and Logloss are in the order of ). The following conclusions are observed:

[leftmargin=*]

Learning feature interactions improves the performance. LR, the only model that does not consider feature interactions, performs worse than the other models. As the best models, DeepFM models outperform LR by 0.91%, 2.75% and 3.48% in terms of AUC (1.21%, 3.89% and 6.2% in terms of Logloss) on Company, CriteoRandom and CriteoSequential data sets respectively.

The performance of a DeepFM model is better than the model that keeps only the FM component or keeps only the Deep component. That is to say, the performance of DeepFMD (DeepFMIP, DeepFMOP, DeepFMP, respectively) is better than both FM and DNN (IPNN, OPNN, PNN, respectively). Table V presents performance improvement of the four DeepFM models over FM component and their deep components on the three data sets.

Learning high and loworder feature interactions simultaneously and properly improves the performance. Among DeepFM models, DeepFMD and DeepFMIP perform the best. DeepFMD and DeepFMIP outperform the models that learn only loworder feature interactions (namely, LR and FM) or highorder feature interactions (namely, FNN, IPNN, OPNN, PNN). Compared with the best baseline that learns high or loworder feature interactions alone, DeepFMD and DeepFMIP achieve more than 0.41%, 0.45% and 0.12% in terms of AUC (0.46%, 0.82% and 0.22% in terms of Logloss) on Company, CriteoRandom and CriteoSequential data sets.

Learning high and loworder feature interactions simultaneously while sharing the same feature embedding for high and loworder feature interactions learning improves the performance. DeepFMD outperforms the models that learn high and loworder feature interactions using separate feature embeddings (namely, LR & DNN and FM & DNN). DeepFMD achieves more than 0.48%, 0.44% and 0.79% in terms of AUC (0.58%, 0.80% and 1.2% in terms of Logloss) on Company, CriteoRandom and CriteoSequential data sets, respectively.
Wide  Deep  

DeepFMD  DeepFMIP  DeepFMOP  DeepFMP  DeepFMD  DeepFMIP  DeepFMOP  DeepFMP  
Company  AUC  0.4%  0.47%  0.39%  0.43%  0.75%  0.67%  0.65%  0.61% 
LogLoss  0.49%  0.61%  0.49%  0.49%  0.91%  0.87%  0.80%  0.72%  
CriteoRandom  AUC  1.54%  1.58%  1.44%  1.28%  1.98%  0.60%  0.34%  0.15% 
LogLoss  2.33%  2.39%  2.08%  1.97%  4.22%  0.85%  0.43%  0.40%  
CriteoSequential  AUC  2.61%  2.24%  2.26%  2.19%  1.19%  0.30%  0.22%  0.12% 
LogLoss  5.1%  4.52%  4.54%  4.41%  1.81%  0.40%  0.26%  0.07% 
Overall, our proposed four DeepFM models perform better than the baseline models in all the cases. In particular, our proposed DeepFMD^{8}^{8}8Although DeepFMIP performs slightly better than DeepFMD on Company dataset, we will still choose DeepFMD in our real scenario to avoid the high time complexity of DeepFMIP. model beats the competitors by more than 0.36% and 0.34% in terms of AUC and Logloss on Company data set. In fact, a small improvement in offline AUC evaluation is likely to lead to a significant increase in online CTR. As reported in [2]
, compared with LR, Wide & Deep improves AUC by 0.275% (offline) and the improvement of online CTR is 3.9%. Moreover, we also conduct ttest between our proposed DeepFM models and the baseline models on the three data sets. We find that the pvalues are all less than
, which indicates that our improvement over existing models is significant.IvC Offline HyperParameter Study
We study the impact of different hyperparameters of different deep models, on Company dataset. The order is: 1) activation functions; 2) dropout rate; 3) number of neurons per layer; 4) number of hidden layers; 5) network shape. It can be clearly observed from the following sections that our proposed DeepFM models are significantly superior, compared with the baseline models, in all the studied cases.
IvC1 Activation Function
According to [9], relu and tanh are more suitable for deep models than sigmoid. The detailed discussion of different activation functions is presented in Section IIIB. In this paper, we compare the performance of deep models when applying relu and tanh as the activation function. As shown in Figure 10, relu is more appropriate than tanh for all the deep models, except for IPNN, due to the reason stated in Section IIIB.
IvC2 Dropout
Dropout [39] refers to the probability that a neuron is kept in the network. Dropout is a regularization technique to compromise the precision and the complexity of the neural network. We set the dropout to be 1.0, 0.9, 0.8, 0.7, 0.6, 0.5. As shown in Figure 11, all the models are able to reach their own best performance when the dropout is properly set (from 0.6 to 0.9). The result shows that adding reasonable randomness to model can strengthen model’s robustness and generalization.
IvC3 Number of Neurons per Layer
When other factors remain the same, increasing the number of neurons per layer introduces complexity. When we study the effect of number of neurons per layer on the performance, we set the number of hidden layers to 3 and keep the number of neurons the same for each hidden layer. As observed from Figure 12, increasing the number of neurons does not always bring benefit. For instance, DeepFMD performs stably when the number of neurons per layer is increased from 400 to 800; even worse, OPNN performs worse when we increase the number of neurons from 400 to 800. This is because an overcomplicated model is easy to overfit. In our dataset, 200 or 400 neurons per layer is a good choice.
IvC4 Number of Hidden Layers
Varying the number of hidden layers, the number of neurons for each hidden layer is fixed. As presented in Figure 13, increasing the number of hidden layers improves the model performance at the beginning, however, their performance is degraded if the number of hidden layers keeps increasing, because of overfitting.
IvC5 Network Shape
We test four different network shapes: constant, increasing, decreasing, and diamond. When we change the network shape, we fix the number of hidden layers and the total number of neurons. For instance, when the number of hidden layers is 3 and the total number of neurons is 600, then four different shapes are: constant (200200200), increasing (100200300), decreasing (300200100), and diamond (150300150). As we can see from Figure 14, the “constant” network shape is empirically better than the other three options, which is consistent with previous studies [43].
IvD Online Experiments
According to the results of the offline experiments on Criteosequential, Criteorandom and Company datasets, the DeepFM models have shown their superior effectiveness over the other existing models in terms of AUC and Logloss. In order to verify the superior of DeepFM in a production environment, we implement and deploy DeepFMD in the recommendation engine of Huawei App Market, which is one of the most popular Android App Markets in China.
Furthermore, we conduct two kinds of experiments to reveal the discriminative power of DeepFMD compared to LR model, in CTR prediction task.

[leftmargin=1em]

A/B test: Besides offline evaluation, it is more valuable to verify that whether DeepFMD is able to perform its superiority as well in the real production environment of Huawei App Market. Therefore we conduct a consecutive seven days’ A/B test, to test the performance of DeepFMD, against a wellengineered LR model.

Online simulation: From a model perspective, compared with LR, DeepFMD is able to capture highorder feature interactions, resulting in highly personalized recommendation. We aim to verify this statement with an online simulation, by analyzing the difference between recommendation results generated by DeepFMD and LR.
In the following of this section, we give a brief description of the experiment settings in Section IVD1, then we present the results of online A/B test and online simulation in Section IVD2 and Section IVD3, respectively.
IvD1 Setting
In this section, we present the settings of A/B test and online simulation, including the experiment set up and evaluation metrics.

[leftmargin=1em]

A/B test: Considering the project launching schedule, we split the users into 2 groups, one group receives the recommendation by an LR model, which is one of the most popular CTR prediction model in industry; the other one gets the recommendation by DeepFMD. The update frequencies of DeepFM and LR models are both on the daily basis. The A/B test is conducted on “fun games” scenario in Huawei App Market. There are hundreds of millions of real users in Huawei App Market from whom consent has been acquired. After online A/B test in consecutive seven days, we collect the number of browsing and downloading records for both user groups in each day. We use CTR (Click Through Rate) and CVR (ConVersion Rate) as evaluation metrics:
(12) (13) where is the number of download records, is the number of browsing records, and is the number of visited users.

Online simulation: Online simulation analyzes the properties of recommendation lists generated by LR and DeepFMD. In order to study the difference between the recommendation results by LR and DeepFMD in terms of personalization and diversity, we compare the cases for different types of users. Differentiated by users’ downloading history, we generate types of users, and each type includes users. The user set is denoted as , where represents the user of type and represents all the users of type . A user of type is generated by sampling several apps of type as the user’s downloading history. Then for user set , we use the trained LR model (and DeepFMD model) to generate recommendation list , where and is for user . Based on the recommendation lists, we adopt personalization@L [44], coverage@L and popularity@L [45] as the evaluation metrics to investigate the differences between LR and DeepFMD model.

The personalization@L considers the diversity of Top places in different users’ recommendation lists. We define the interlist distance between recommendation lists of user and user as
(14) where is the common items in the Top places of both lists. The intergroup distance between recommendation lists of user group and is defined as the aggregated interlist distances between the users across and , as
(15) where is cardinality of user group . Finally, the personalization of recommendation lists by a model is defined as aggregated intergroup distances between all pairs of user groups, as
(16) 
The coverage@L considers the percentage of recommended apps in Top places of all the users’ recommendation lists over all the candidate apps.
(17) where is the recommended apps in Top places in .

In addition, we also measure the popularity@L, which is defined as:
(18) where is the number of historical cumulative downloads of the recommended app in , is the number of historical cumulative downloads of the most downloaded app.

IvD2 Performance of Online Experiments
In this section, we present the results of online A/B test. Because of commercial concerns, we only report the improvements of DeepFMD over LR in terms of CTR and CVR, as shown in Figure 15. The xaxis represents different days, and the yaxis is the improvement of DeepFMD over LR. Note that the blue bar with slash line represents the improvement of CTR, while the red bar with horizonal line represents the improvement of CVR.
The histograms shows that the performance of DeepFMD is consistently better than LR, through the whole A/B testing period. Specifically, the improvements of DeepFMD over LR are at least 10% in terms of CTR and CVR, except for the CVR on day7 (which is still very close to 10%). In addition, the highest improvement of CTR reaches about 24% on the day4 and the maximum of CVR improvement is about 25% on day1. The online A/B test results reveal that DeepFMD leads to a higher CTR and CVR over LR in a recommendation engine of industry scale.
IvD3 The Property of Online Recommendations
To better understand the results of the online experiments, we conduct online simulation experiments and make a comparison of personalization@L, coverage@L and popularity@L () between the recommendation lists by LR and DeepFMD. The results are presented in Figure 16, Figure 17 and Figure 18, respectively. In these three figures, the blue bar with slash line and the red bar with horizonal line represent the measurement of recommendation lists by LR and DeepFMD, respectively.
As shown in Figure 16 and Figure 17, the personalization@L and coverage@L of recommendation lists generated by DeepFMD are much larger than that of LR. The personalization@L is the aggregated intergroup distance between recommendation lists of different user groups, therefore a low personalization@L means Top places in the recommendation lists of different users are similar. The coverage@L has similar semantics as the personalization@L. When coverage@L is low, the Top places in the recommendation lists concentrate in a small range of apps. The results of personalization@L and coverage@L demonstrate that the Top places in recommendation lists of DeepFMD are more diverse than that of LR.
Figure 18 presents the comparison of popularity@L of recommendation lists generated by LR and DeepFMD. There are two indices in this experiment, the average and the variance of the historical cumulative downloads of the apps contained in the Top places in recommendation lists. Specifically, the black line on the top of each bar is the variance of the 600 recommendation lists (100 users per type 6 types). Compared with DeepFMD, LR generates the recommendation lists with higher average popularity and lower variance. That is to say, LR model trends to recommend the popular apps in top positions and is more likely to ignore the specific interest of different users. In contrast, due to the superior ability on capturing feature interactions, DeepFMD is able to capture specific interests of different users better.
V Conclusions
In this paper, we proposed DeepFM, an endtoend wide & deep learning framework for CTR prediction, to overcome the shortcomings of the stateoftheart models. DeepFM trains a deep component and an FM component jointly. It gains performance improvement from these advantages: 1) it does not need any pretraining; 2) it learns both high and loworder feature interactions; 3) it introduces a sharing strategy of feature embedding to avoid feature engineering. We studied two instances of DeepFM framework, namely DeepFMD and DeepFMP, of which the deep component are DNN and PNN, respectively. The offline experiments on three realworld data sets demonstrate that 1) our proposed DeepFMD and DeepFMP outperform the stateoftheart models in terms of AUC and Logloss on all the three datasets; 3) As one of the best performed model, DeepFMD has comparable efficiency as LR model on GPU, which is acceptable in industrial applications.
To verify the superiority of DeepFM framework in production environment, we deployed DeepFMD in the recommendation engine of Huawei App Market. We also covered related practice in deploying our framework, such as multiGPU architecture and asynchronous data reading. Compared with a wellengineered LR model, which is one of the most popular CTR prediction models, DeepFMD achieves more than 10% improvement of CTR in online A/B test.
References
 [1] C. L. A. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon, “Novelty and diversity in information retrieval evaluation,” in ACM SIGIR, 2008, pp. 659–666.
 [2] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” CoRR, vol. abs/1606.07792, 2016.
 [3] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica, “Ad click prediction: a view from the trenches,” in ACM SIGKDD, 2013.
 [4] S. Rendle, “Factorization machines,” in ICDM.
 [5] Q. Liu, F. Yu, S. Wu, and L. Wang, “A convolutional click prediction model,” in CIKM.

[6]
Y. Zhang, H. Dai, C. Xu, J. Feng, T. Wang, J. Bian, B. Wang, and T. Liu, “Sequential click prediction for sponsored search with recurrent neural networks,” in
AAAI.  [7] J. Chen, B. Sun, H. Li, H. Lu, and X. Hua, “Deep CTR prediction in display advertising,” in ACM MM.
 [8] W. Zhang, T. Du, and J. Wang, “Deep learning over multifield categorical data   A case study on user response prediction,” in ECIR.
 [9] Y. Qu, H. Cai, K. Ren, W. Zhang, Y. Yu, Y. Wen, and J. Wang, “Productbased neural networks for user response prediction,” CoRR, vol. abs/1611.00144, 2016.

[10]
H. Guo, R. Tang, Y. Ye, Z. Li, and X. He, “Deepfm: A factorizationmachine
based neural network for CTR prediction,” in
Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence
, 2017, pp. 1725–1731.  [11] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl, “Itembased collaborative filtering recommendation algorithms,” in WWW, pp. 285–295.
 [12] Y. Koren, R. M. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” IEEE Computer, vol. 42, no. 8, pp. 30–37, 2009.
 [13] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in ACM SIGKDD. ACM, pp. 426–434.
 [14] L. Si and R. Jin, “Flexible mixture model for collaborative filtering,” in ICML, 2003, pp. 704–711.
 [15] C. Wang and D. M. Blei, “Collaborative topic modeling for recommending scientific articles,” in ACM SIGKDD, 2011, pp. 448–456.
 [16] T. Chen, W. Zhang, Q. Lu, K. Chen, Z. Zheng, and Y. Yu, “Svdfeature: a toolkit for featurebased collaborative filtering,” JMLR, vol. 13, pp. 3619–3622, 2012.
 [17] X. Li, G. Cong, X. Li, T. N. Pham, and S. Krishnaswamy, “Rankgeofm: A ranking based geographical factorization method for point of interest recommendation,” in ACM SIGIR, 2015, pp. 433–442.
 [18] Y. Chang, C. Hsieh, K. Chang, M. Ringgaard, and C. Lin, “Training and testing lowdegree polynomial data mappings via linear svm,” JMLR, vol. 11, pp. 1471–1490, 2010.
 [19] Y. Juan, Y. Zhuang, W. Chin, and C. Lin, “Fieldaware factorization machines for CTR prediction,” in RecSys, 2016, pp. 43–50.
 [20] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. Candela, “Practical lessons from predicting clicks on ads at facebook,” in ADKDD, 2014, pp. 5:1–5:9.
 [21] S. Rendle and L. SchmidtThieme, “Pairwise interaction tensor factorization for personalized tag recommendation,” in WSDM, 2010, pp. 81–90.
 [22] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich, “Webscale bayesian clickthrough rate prediction for sponsored search advertising in microsoft’s bing search engine,” in ICML, 2010, pp. 13–20.
 [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.

[24]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in
NIPS, 2013, pp. 3111–3119.  [25] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep recurrent neural networks,” in IEEE ICASSP, 2013, pp. 6645–6649.
 [26] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [27] P. Covington, J. Adams, and E. Sargin, “Deep neural networks for youtube recommendations,” in RecSys, 2016, pp. 191–198.
 [28] R. Salakhutdinov, A. Mnih, and G. E. Hinton, “Restricted boltzmann machines for collaborative filtering,” in ICML, 2007, pp. 791–798.
 [29] A. van den Oord, S. Dieleman, and B. Schrauwen, “Deep contentbased music recommendation,” in NIPS, 2013, pp. 2643–2651.
 [30] Y. Wu, C. DuBois, A. X. Zheng, and M. Ester, “Collaborative denoising autoencoders for topn recommender systems,” in ACM WSDM, 2016, pp. 153–162.
 [31] C. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing, “Recurrent recommender networks,” in WSDM, 2017, pp. 495–503.
 [32] L. Zheng, V. Noroozi, and P. S. Yu, “Joint deep modeling of users and items using reviews for recommendation,” in WSDM, 2017, pp. 425–434.

[33]
S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders meet collaborative filtering,” in
WWW, 2015, pp. 111–112.  [34] H. Wang, N. Wang, and D. Yeung, “Collaborative deep learning for recommender systems,” in ACM SIGKDD, 2015, pp. 1235–1244.

[35]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,”
JMLR, vol. 11, pp. 3371–3408, 2010.  [36] X. Wang and Y. Wang, “Improving contentbased and hybrid music recommendation using deep learning,” in ACM MM, 2014, pp. 627–636.
 [37] S. Zhang, L. Yao, and A. Sun, “Deep learning based recommender system: A survey and new perspectives,” CoRR, vol. abs/1707.07435, 2017.
 [38] S. Rendle, Z. Gantner, C. Freudenthaler, and L. SchmidtThieme, “Fast contextaware recommendations with factorization machines,” in ACM SIGIR, 2011, pp. 635–644.
 [39] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” JMLR, vol. 15, no. 1, pp. 1929–1958, 2014.
 [40] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On largebatch training for deep learning: Generalization gap and sharp minima,” CoRR, vol. abs/1609.04836, 2016.
 [41] T. Fawcett, “An introduction to roc analysis,” Pattern Recogn. Lett., vol. 27, no. 8, pp. 861–874, Jun. 2006.
 [42] I. J. Goodfellow, Y. Bengio, and A. C. Courville, Deep Learning, ser. Adaptive computation and machine learning. MIT Press, 2016.
 [43] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies for training deep neural networks,” JMLR, vol. 10, pp. 1–40, 2009.
 [44] T. Zhou, Z. Kuscsik, J.G. Liu, M. Medo, J. R. Wakeling, and Y.C. Zhang, “Solving the apparent diversityaccuracy dilemma of recommender systems,” Proceedings of the National Academy of Sciences, vol. 107, no. 10, pp. 4511–4515, 2010.
 [45] D. Jannach and M. Ludewig, “When recurrent neural networks meet the neighborhood for sessionbased recommendation,” in Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys 2017, Como, Italy, August 2731, 2017, 2017, pp. 306–310.