1. Introduction
Predicting a user’s response to some item (e.g., movie, news article, advertising post) under certain context (e.g., website) is a crucial component for personalized information retrieval (IR) and filtering scenarios, such as online advertising (McMahan et al., 2013; Ren et al., 2016), recommender system (Koren et al., 2009; Rendle, 2010), and web search (Agichtein et al., 2006; Chapelle and Zhang, 2009).
The core of personalized service
is to estimate the probability that a user will “like”, “click”, or “purchase” an item, given features about the user, the item, and the context
(Menon et al., 2011). This probability indicates the user’s interest in the specific item and influences the subsequent decisionmaking such as item ranking (Xue et al., 2004) and ad bidding (Zhang et al., 2014b). Taking online advertising as an example, the estimated ClickThrough Rate (CTR) will be utilized to calculate a bid price in an ad auction to improve the advertisers’ budget efficiency and the platform revenue (Zhang et al., 2014b; Perlich et al., 2012; Ren et al., 2016). Hence, it is much desirable to gain accurate prediction to not only improve the user experience, but also boost the volume and profit for the service providers.TARGET  WEEKDAY  GENDER  CITY 

1  Tuesday  Male  London 
0  Monday  Female  New York 
1  Tuesday  Female  Hong Kong 
0  Tuesday  Male  Tokyo 
Number  7  2  1000 
The data collection in these tasks is mostly in a multifield categorical form. And each data instance is normally transformed into a highdimensional sparse (binary) vector via onehot encoding (He et al., 2014). Taking Table 1 as an example, the 3 fields are onehot encoded and concatenated as
Each field is represented as a binary vector, of which only 1 entry corresponding to the input is set as 1 while others are 0. The dimension of a vector is determined by its field size, i.e., the number of unique categories^{1}^{1}1 For clarity, we use “category” instead of “feature” to represent a certain value in a categorical field. For consistency with previous literature, we preserve “feature” in some terminologies, e.g., feature combination, feature interaction, and feature representation. in that field. The onehot vectors of these fields are then concatenated together in a predefined order.
Without loss of generality, user response prediction can be regarded as a binary classification problem, and 1/0 are used to denote positive/negative responses respectively (Richardson et al., 2007; Graepel et al., 2010).
A main challenge of such a problem is sparsity. For parametric models, they usually convert the sparse (binary) input into dense representations (e.g., weights, latent vectors), and then search for a separable hyperplane. Fig.
1 shows the model decomposition. In this paper, we mainly focus on modeling and training, thus we exclude preliminary feature engineering (Cui et al., 2011).Many machine learning models are leveraged or proposed to work on such a problem, including linear models, latent vectorbased models, tree models, and DNNbased models. Linear models, such as Logistic Regression (LR)
(Lee et al., 2012) and Bayesian Probit Regression (Graepel et al., 2010), are easy to implement and with high efficiency. A typical latent vectorbased model is Factorization Machine (FM) (Rendle, 2010). FM uses weights and latent vectors to represent categories. According to their parametric representations, LR has a linear feature extractor, and FM has a bilinear^{2}^{2}2Although FM has higherorder formulations (Rendle, 2010), due to the efficiency and practical performance, FM is usually implemented with secondorder interactions.feature extractor. The prediction of LR and FM are simply based on the sum over weights, thus their classifiers are linear.
FM works well on sparse data, and inspires a lot of extensions, including Fieldaware FM (FFM) (Juan et al., 2016). FFM introduces fieldaware latent vectors, which gain FFM higher capacity and better performance. However, FFM is restricted by space complexity. Inspired by FFM, we find a coupled gradient issue of latent vectorbased models and refine feature interactions^{3}^{3}3In (Rendle, 2010), the cross features learned by FM are called feature interactions. as fieldaware feature interactions. To solve this issue as well as saving memory, we propose kernel product methods and derive Kernel FM (KFM) and Network in FM (NIFM).Trees and DNNs are potent function approximators.
Tree models, such as Gradient Boosting Decision Tree (GBDT)
(Chen and Guestrin, 2016), are popular in various data science contests as well as industrial applications. GBDT explores very high order feature combinations in a nonparametric way, yet its exploration ability is restricted when feature space becomes extremely highdimensional and sparse.
DNN has also been preliminarily studied in information system literature (Zhang et al., 2016; Covington et al., 2016; Shan et al., 2016; Qu et al., 2016). In (Zhang et al., 2016), FM supported Neural Network (FNN) is proposed. FNN has a pretrained embedding^{4}^{4}4We use “latent vector” in shallow models, and “embedding vector” in DNN models. layer and several fully connected layers.Since the embedding layer indeed performs a linear transformation, FNN mainly extracts linear information from the input.
Inspired by (ShalevShwartz et al., 2017), we find an insensitive gradient issue that fully connected layers cannot fit such target functions perfectly.From the model decomposition perspective, the above models are restricted by inferior feature extractors or weak classifiers. Incorporating product operations in DNN, we propose Productbased Neural Network (PNN). PNN consists of an embedding layer, a product layer, and a DNN classifier. The product layer serves as the feature extractor which can make up for the deficiency of DNN in modeling feature interactions. We take FM, KFM, and NIFM as feature extractors in PNN, leading to Inner Productbased Neural Network (IPNN), Kernel Productbased Neural Network (KPNN), and Productnetwork In Network (PIN).
CTR estimation is a fundamental task in personalized advertising and recommender systems, and we take CTR estimation as the working example to evaluate our models. Extensive experiments on 4 largescale realworld datasets and 1 contest dataset demonstrate the consistent superiority of our models over 8 baselines (Lee et al., 2012; Rendle, 2010; Liu et al., 2015; Zhang et al., 2016; Juan et al., 2016; Guo et al., 2017; Chen and Guestrin, 2016; Xiao et al., 2017) on both AUC and log loss. Besides, PIN makes great CTR improvement (34.67%) in online A/B test. To sum up, our contributions can be highlighted as follows:

We analyze a coupled gradient issue of latent vectorbased models. We refine feature interactions as fieldaware feature interactions and extend FM with kernel product methods. Our experiments on KFM and NIFM successfully verify our assumption.

We analyze an insensitive gradient issue of DNNbased models and propose PNNs to tackle this issue. PNN has a flexible architecture which can generalize previous models.

We study optimization, regularization, and other practical issues in training and generalization. In our extensive experiments, our models achieve consistently good results in 4 offline datasets, 1 contest, and 1 online A/B test.
The rest of this paper is organized as follows. In Section 2, we introduce related work in user response prediction and other involved techniques. In Section 3 and 4, we present our PNN models in detail and discuss several practical issues. In Section 5, we show offline evaluation, parameter study, online A/B test, and synthetic experiments respectively. We finally conclude this paper and discuss future work in Section 6.
2. Background and Related Work
2.1. User Response Prediction
User response prediction is normally formulated as a binary classification problem with prediction loglikelihood or crossentropy as the training objective (Richardson et al., 2007; Graepel et al., 2010; Agarwal et al., 2010)
. area under ROC curve (AUC), log loss and relative information gain are common evaluation metrics
(Graepel et al., 2010). Due to the onehot encoding of categorical data, sparsity is a big challenge in user response prediction.From the modeling perspective, linear Logistic Regression (LR) (Lee et al., 2012; Ren et al., 2016), bilinear Factorization Machine (FM) (Rendle, 2010) and Gradient Boosting Decision Tree (GBDT) (He et al., 2014) are widely used in industrial applications. As illustrated in Fig. 2, LR extracts linear information from the input, FM further extracts bilinear information, while GBDT explores feature combinations in a nonparametric way. From the training perspective, many adaptive optimization algorithms can speed up training of sparse data, including Follow the Regularized Leader (FTRL) (McMahan et al., 2013)
, Adaptive Moment Estimation (Adam)
(Kingma and Ba, 2014), etc.These algorithms follow a percoordinate learning rate scheme, making them converge much faster than stochastic gradient descent (SGD).
From the representation perspective, latent vectors are expressive in representing categorical data. In FM, the side information and user/item identifiers are represented by lowdimensional latent vectors, and the feature interactions are modeled as the inner product of latent vectors. As an extension of FM, Fieldaware FM (FFM) (Juan et al., 2016) enables each category to have multiple latent vectors. From the classification perspective, powerful function approximators like GBDT and DNN are more suitable for continuous input. Therefore, in many contests, the winning solutions take FM/FFM as feature extractors to process discrete data, and use the latent vectors or interactions as the input of successive classifiers (e.g., GBDT, DNN). According to model decomposition (Fig. 1), latent vectorbased models make predictions simply based on the sum of interactions. This weakness motivates the DNN variants of latent vectorbased models.
2.2. DNNbased Models
With the great success of deep learning, it is not surprising there emerge some deep learning techniques for recommender systems (Zhang et al., 2017)
. Primary work includes: (i) Pretraining autoencoders to extract feature representations, e.g., Collaborative Deep Learning
(Wang et al., 2015). (ii) Using DNN to model general interactions (He et al., 2017; He and Chua, 2017; Qu et al., 2016). (iii) Using DNN to process images/texts in contentbased recommendation (Wang et al., 2017). We mainly focus on (ii) in this paper.The input to DNN is usually dense and numerical, while the case of multifield categorical data has not been well studied. FM supported Neural Network (FNN) (Zhang et al., 2016) (Fig. 3(a)) has an embedding layer and a DNN classifier. Besides, FNN uses FM to pretrain the embedding layer. Other models use DNN to improve FM, e.g., Neural Collaborative Filtering (NCF) (He et al., 2017), Neural FM (NFM) (He and Chua, 2017) , Attentional FM (AFM) (Xiao et al., 2017). NCF uses DNN to solve collaborative filtering problem. NFM extends NCF to more general recommendation scenarios. Based on NFM, AFM uses attentive mechanism to improve feature interactions, and becomes a stateoftheart model.
Convolutional Click Prediction Model (CCPM) (Liu et al., 2015) (Fig. 3(b)) uses convolutional layers to explore localglobal dependencies. CCPM performs convolutions on the neighbor fields in a certain alignment, but fails to model convolutions among nonneighbor fields. RNNs are leveraged to model historical user behaviors (Zhang et al., 2014a). In this paper, we do not consider sequential patterns.
Wide & Deep Learning (WDL) (Cheng et al., 2016) trains a wide model and a deep model jointly. The wide part uses LR to “memorize”, meanwhile, the deep part uses DNN to “generalize”. Compared with single models, WDL achieves better AUC in offline evaluations and higher CTR in online A/B test. WDL requires human efforts for feature engineering on the input to the wide part, thus is not endtoend. DeepFM (Guo et al., 2017), as illustrated in Fig. 3(c), can both utilize the strengths of WDL and avoid expertise in feature engineering. It replaces the wide part of the WDL with FM. Besides, the FM component and the deep component share same embedding parameters. DeepFM is regarded as one stateoftheart model in user response prediction.
To complete our discussion of DNNbased models, we list some less relevant work, such as the following. Product Unit Neural Network (Engelbrecht et al., 1999)
defines the output of each neuron as the product of all its inputs. Multilinear FM
(Lu et al., 2017) studies FM in a multitask setting. DeepMood (Cao et al., 2017) presents a neural network view for FM and Multiview Machine.2.3. NetinNet Architecture
Network In Network (NIN) (Lin et al., 2013)
is originally proposed in CNN. NIN builds micro neural networks between convolutional layers to abstract the data within the receptive field. Multilayer perceptron as a potent function approximator is used in micro neural networks of NIN. GoogLeNet
(Szegedy et al., 2015) makes use of the micro neural networks suggested in (Lin et al., 2013) and achieves great success. NIN is powerful in modeling local dependencies. In this paper, we borrow the idea of NIN, and propose to explore interfield feature interactions with flexible micro networks.3. Methodology
As in Fig. 1, the difficulties in learning multifield categorical data are decomposed into 2 phases: representation and classification. Following this idea, we first explain fieldaware feature interactions, then we study the deficiency of DNN classifiers, finally, we present our Productbased Neural Networks (PNNs) in detail.
A commonly used objective for user response prediction is to minimize cross entropy, or namely log loss, defined as
(1) 
where is the label and is the predicted probability of , more specifically, the probability of a user giving a positive response on the item. We adopt this training objective in all experiments.
3.1. Fieldaware Feature Interactions
In user response prediction, the input data contains multiple fields, e.g., WEEKDAY, GENDER, CITY. A field contains multiple categories and takes one category in each data instance. Table 1 shows 4 data examples, each of which contains 3 fields, and each field takes a single value. For example, a Male customer located in London buys some beer on Tuesday. From this record we can extract a useful feature combination: “Male and London and Tuesday implies True”. The efficacy of feature combinations (a.k.a., cross features) has already been proved (Menon et al., 2011; Rendle, 2010). In FM, the nd order combinations are called feature interactions.
Assume the input data has categorical fields, , where is an ID indicating a category of the th field. The feature representations learned by parametric models could be weight coefficients (e.g., in LR) or latent vectors (e.g., in FM). For an input instance , each field is converted into corresponding weight or latent vector . For an output
, the probability is obtained from sigmoid function
. For simplicity, we use and to represent the input, and we use to represent the output.3.1.1. A Coupled Gradient Issue of Latent Vectorbased Models
The prediction of FM (Rendle, 2010) can be formulated as
(2) 
where is the weight of category , is the latent vector of , and is the global bias. Take the first example in Table 1,
The gradient of the latent vector is . FM makes an implicit assumption that a field interacts with different fields in the same manner, which may not be realistic. Suppose GENDER is independent of WEEKDAY, it is desirable to learn . However, the gradient continuously updates in the direction of . Conversely, the latent vector is updated in the direction of . To summarize, FM uses the same latent vectors in different types of interfield interactions, which is an oversimplification and degrades the model capacity. We call this problem a coupled gradient issue.
The gradients of latent vectors could be decoupled by assigning different weights to different interactions, such as the attentive mechanism in Attentional FM (AFM) (Xiao et al., 2017):
(3) 
where denotes the attention network which takes embedding vectors as input and assigns weights to interactions, . In AFM, the gradient of becomes , where the gradients of and are decoupled when approaches 0. However, when the attention score approaches 0, the attention network becomes hard to train.
This problem is solved by Fieldaware FM (FFM) (Juan et al., 2016)
(4) 
where , meaning the th category has independent dimensional latent vectors when interacting with fields. Excluding intrafield interactions, we usually use in practice. Using fieldaware latent vectors, the gradients of different interactions are decoupled, e.g., , . This leads to the main advantage of FFM over FM and brings a higher capacity.
FFM makes great success in various data science contests. However, it has a memory bottleneck, because its latent vectors have parameters (FM is ), where , the total number of categories, is usually in million to billion scales in practice. When is large, must be small enough to fit FFM in memory, which severely restricts the expressive ability of latent vectors. This problem is also discussed in Section 5.3 through visualization. To tackle the problems of FM and FFM, we propose kernel product.
3.1.2. Kernel Product
Matrix factorization (MF) learns lowrank representations of a matrix. A matrix can be represented by the product of two lowrank matrices and , i.e., . MF estimates each observed element with the inner product of two latent vectors and . The target of MF is to find optimal latent vectors which can minimize the empirical error
(5)  
(6) 
where is the inner product of two vectors,
represents the loss function (e.g., root mean square error, log loss).
MF has another form, , where can be regarded as a projection matrix. and factorize in the projected space like and do in the original space. We define the inner product in a projected space, namely kernel product, , then we can extend MF
(7)  
(8) 
where is a projection matrix, namely kernel, and is the parameter space. Vector inner product can be regarded as a special case of kernel product when . MF can be regarded as a special case of kernel MF when . Kernel product also generalizes vector outer product. The convolution sum of an outer product is equivalent to a kernel product
(9) 
where denotes convolution sum. It is worth noting that, the outer product form has multiplication and addition operations, while the kernel product form has multiplication and addition operations. Therefore, kernel product generalizes both vector inner product and outer product. In Eq. (8), the kernel matrix is optimized in a parameter space, and a kernel matrix maps two vectors to a real value. From this point, a kernel is equivalent to a function. We can define kernel product in parameter or function spaces to adjust to different problems. In this paper, we study (i) linear kernel, and (ii) micro network kernel.
In practice, field size (number of categories in a field) varies dramatically (e.g., GENDER=2, CITY=7). Field size reflects the amount of information contained in one field. It is natural to represent a large (small, respectively) field in a large (small, respectively) latent space, and we call it adaptive embedding. In (Covington et al., 2016), an adaptive embedding size is decided by the logarithm of the field size. The challenge is how to combine adaptive embeddings with MF, since inner product requires the latent vectors to have the same length. Kernel product can solve this problem
(10) 
where , , and .
3.1.3. Fieldaware Feature Interactions
The coupled gradient issue of latent vectorbased models can be solved by fieldaware feature interactions. FFM learns fieldaware feature interactions with fieldaware latent vectors. However, the space complexity of FFM is , which restricts FFM from using large latent vectors. A relaxation of FFM is projecting latent vectors into different kernel spaces. Corresponding to interfield interactions, the kernels require extra space. Since , the total space complexity is still . In this paper, we extend FM with (i) linear kernel, and (ii) micro network kernel.
Kernel FM (KFM):
(11) 
where is the kernel matrix of field and .
Network in FM (NIFM):
(12)  
(13) 
where denotes a micro network taking latent vectors as input and producing feature interactions with nonlinearity. In Eq. (13), the micro network output has no bias term because it is redundant with respect to the global bias . This model is inspired by netinnet architecture (Lin et al., 2013; Szegedy et al., 2015). With flexible micro networks, we can control the complexity and take the advantages of enormous deep learning techniques.
Recall the first example in Table 1. Suppose GENDER is independent of WEEKDAY, we can have if (i) , or (ii) . Thus, the gradients of latent vectors are decoupled. Comparing KFM with FM/FFM: (i) KFM bridges FM with FFM. (ii) The capacity of KFM is between FM and FFM, because KFM shares kernels among certain types of interfield interactions. (iii) KFM reparametrizes FFM, which is “time for space”. Comparing KFM/NIFM with AFM, AFM uses a universal attention network which is fieldsharing, while KFM/NIFM use multiple kernels which are fieldaware. If we share one kernel among all interfield interactions, it will become an attention network. Thus, KFM/NIFM generalize AFM. Comparing kernel product with CNN, their kernels are both used to extract feature representations. Besides, kernel product shares projection matrices/functions among interactions, and CNN shares kernels in space.
3.2. Training Feature Interactions with Trees or DNNs is Difficult
In the previous section, we propose kernel product to solve the coupled gradient issue in latent vectorbased models. On the other hand, trees and DNNs can approximate very complex functions. In this section, we analyze the difficulties of trees and DNNs in learning feature interactions.
3.2.1. A Sparsity Issue of Trees
Growing a decision tree performs greedy search among categories and splitting points (Quinlan, 1996). Tree models encounter a sparsity issue in multifield categorical data. Here gives an example. A tree with depth 10 has at most leaf nodes (), and a tree with depth 20 has at most leaf nodes (). Suppose we have a dataset with 10 categorical fields, each field contains categories, and the input dimension is after onehot encoding. This dataset has categories, nd order feature combinations, and
full order feature combinations. From this example, we can see that even a very deep tree model can only explore a small fraction of all possible feature combinations on such a small dataset. Therefore, modeling capability of tree models, e.g., Decision Tree, Random Forest and Gradient Boosting Decision Tree (GBDT)
(Chen and Guestrin, 2016), is restricted in multifield categorical settings.3.2.2. An Insensitive Gradient Issue of DNNs
Gradientbased DNNs refer to the DNNs based on gradient descent and backpropagation. Despite the universal approximation property
(Hornik et al., 1989), there is no guarantee that a DNN naturally converges to any expected functions using gradientbased optimization. In user response prediction, the target function can be represented by a set of rules, e.g., “Male and London and Tuesday implies True”. Here we show the periodic property of the target function via parity check. Recall the examples in Table 1, a feasible classifier is , where is the input, is the checking rule, and is the hypothesis space.is accepted by the predictor because 3 (which is odd) conditions are matched and
is also accepted because 1 (which is also odd) condition is matched. In contrast, and are rejected since 2 and 0 (which are even) conditions are matched. Two examples are shown in Fig. 4.From this example, we observe that, in multifield categorical data: (i) Basic rules can be represented by feature combinations, and several basic rules induce a parity check. (ii) The periodic property reveals that a feature set giving positive results does not conclude its superset nor its subset being positive. (iii) Any target functions should also reflect the periodic property.
A recent work (ShalevShwartz et al., 2017)
proves an insensitive gradient issue of DNN: (i) If the target is a large collection of uncorrelated functions, the variance (sensitivity) of DNN gradient to the target function decreases linearly with
. (ii) When variance decreases, the gradient at any point will be extremely concentrated around a fixed point independent of . (iii) When the gradient is independent of the target , it contains little useful information to optimize DNN, thus gradientbased optimization fails to make progress. The authors in (ShalevShwartz et al., 2017) use the variance of gradient w.r.t. hypothesis space to measure the useful information in gradient, and explain the failure of gradientbased deep learning with an example of parity check.Considering the large hypothesis space of DNNs, we conclude that learning feature interactions with gradientbased DNNs is difficult. In another word, DNNs can hardly learn feature interactions implicitly or automatically. We conduct a synthetic experiment to support this idea in Section 5.4. And we propose product layers to help DNNs tackle this problem.
3.3. Productbased Neural Networks
3.3.1. DNNbased Models
For consistency, we introduce DNNbased models according to the 3 components: the embedding module, the interaction module, and the DNN classifier
(14) 
AFM (Xiao et al., 2017) has already been discussed in Section 3.1. AFM uses an attention network to improve FM, yet its prediction is simply based on the sum of interactions,
(15) 
FM supported Neural Network (FNN) (Zhang et al., 2016) (Fig. 3(a)) is formulated as
(16) 
where is initialized from a pretrained FM model. Compared with shallow models, FNN has a powerful DNN classifier, which gains it significant improvement. However, without the interaction module, FNN may fail to learn expected feature interactions automatically.
Similarly, Convolutional Click Prediction Model (CCPM) (Liu et al., 2015) (Fig. 3(b)) is formulated as
(17) 
where denotes convolutional layers which are expected to explore localglobal dependencies. CCPM only performs convolutions on the neighbor fields in a certain alignment but fails to model the full convolutions among nonneighbor fields.
DeepFM (Guo et al., 2017) (Fig. 3(c)) improves Wide & Deep Learning (WDL) (Cheng et al., 2016). It replaces the wide model with FM and gets rid of feature engineering expertise.
(18) 
Note that the embedding vectors of the FM part and the FNN part are shared in DeepFM. WDL and DeepFM follows a joint training scheme. In another word, other single models can also be integrated into a mixture model, yet the joint training scheme is beyond our discussion.
FNN has a linear feature extractor (without pretraining) and a DNN classifier. CCPM additionally explores local/global dependencies with convolutional layers, but the exploration is limited in neighbor fields. DeepFM has a bilinear feature extractor, yet the bilinear feature representations are not fed to its DNN classifier. The insensitive gradient issue of DNNbased models has already been discussed in Section 3.2. To solve this issue, we propose Productbased Neural Network (PNN). The general architecture of PNN is illustrated in Fig. 5(a).
(19)  
(20)  
(21) 
The embedding layer (19) is fieldwisely connected. This layer looks up the corresponding embedding vector for each field , and produces dense representations of the sparse input, . The product layer (20) uses multiple product operators to explore feature interactions . The DNN classifier (21), takes both the embeddings and the interactions as input, and conduct the final prediction .
Using FM, KFM, and NIFM as feature extractors, we develop 3 PNN models: Inner Productbased Neural Network (IPNN), Kernel Productbased Neural Network (KPNN), and Productnetwork In Network (PIN), respectively. We decompose all discussed parametric models in Table 2. A component level comparison is in Table 3, e.g., FM kernel product KFM.
One may concern the complexity, initialization, or training of PNNs. As for the complexity, there are interactions, yet the complexity may not be a bottleneck: (i) In practice, is usually a small number. In our experiments, the datasets involved contain at most 39 fields. (ii) The computation of interactions is independent and can speed up via parallelization. PNN concatenates embeddings and interactions as the DNN input. The embeddings and the interactions follow different distributions, which may cause problems in initialization and training. One solution to this potential risk is careful initialization and normalization. These practical issues are discussed in Section 4, and corresponding experiments are in Section 5.2.





LR  FM 




DeepFM, IPNN  KPNN, PIN 
FM  KFM  NIFM  FNN  KPNN  PIN  

embedding  yes  yes  yes  yes  yes  yes 
kernel  yes  yes  
netinnet  yes  yes  
DNN  yes  yes  yes 
3.3.2. Inner Productbased Neural Network
IPNN uses FM as the feature extractor, where the feature interactions are defined as inner products of the embeddings, as illustrated in Fig. 5(b). The embeddings of , and the interactions of are flattened and fully connected with the successive hidden layer
(22)  
(23) 
Comparing IPNN with Neural FM (Section 2.2), their inputs to DNN classifiers are quite different: (i) In Neural FM, the interactions are summed up and passed to DNN. (ii) In IPNN, the interactions are concatenated and passed to DNN. Since AFM has no DNN classifier, it is compared with KFM/NIFM in Section 3.1. Comparing IPNN with FNN: (i) FNN explores feature interactions through pretraining). (ii) IPNN explores feature interactions through the product layer. Comparing IPNN with DeepFM: (i) DeepFM adds up the feature interactions to the model output. (ii) IPNN feeds the feature interactions to the DNN classifier.
3.3.3. Kernel Productbased Neural Network
KPNN utilizes KFM as the feature extractor, where the feature interactions are defined as kernel products of the embeddings, as illustrated in Fig. 5(c). Since kernel product generalizes outer product, we use kernel product as a general form. The formulation of KPNN is similar to IPNN, except that
(24)  
(25) 
3.3.4. Productnetwork In Network
A micro network is illustrated in Fig. 5(d)^{5}^{5}5We test several subnet architectures and the structure in Fig. 5(d) is a relatively good choice.. In PIN, the computation of several subnetwork^{6}^{6}6In this paper, we use micro network and subnetwork interchangeably.
forward passes are merged into a single tensor multiplication
(26)  
(27) 
where is the input to subnetwork , and is the input size. concatenates all together, , where is the number of subnetworks. The weights of subnetwork are also concatenated to a weight tensor , where is the output size of a subnetwork. Applying tensor multiplication on dimension and elementwise nonlinearity , we get .
Layer normalization (LN) (Ba et al., 2016) is proposed to stabilize the activation distributions of DNN. In PIN, we use fused LN on subnetworks to stabilize training. For each data instance, LN collects statistics from different neurons and normalizes the output of these neurons. However, the subnetworks are too small to provide stable statistics. Fused LN instead collects statistics from all subnetworks, thus is more stable than LN. More detailed discussions are in Section 4.4, and corresponding experiments are in Section 5.2.
(28)  
(29) 
Replacing with , the PIN model is presented as follows,
(30)  
(31)  
(32)  
(33) 
where denotes elementwise product instead of convolution sum. To stabilize micro network outputs, LN can be inserted into the hidden layers of the micro networks.
Compared with NIFM, the subnetworks of PIN are slightly different. (i) Each subnetwork takes an additional product term as the input. (ii) The subnetwork bias is necessary because there is no global bias like NIFM. (iii) The subnetwork output is a scaler in NIFM, however, it could be a vector in PIN. Compared with other PNNs, the embedding vectors are no longer fed to the DNN classifier because the embeddingDNN connections are redundant. The embeddingDNN connections:
(34) 
In terms of the embeddingsubnet connections, the input contains several concatenated embedding pairs , and each pair is passed through some micro network. For simplicity, we regard the micro networks as linear transformations, thus the weight matrix can be represented by , where the input dimension is twice the embedding size, and the output dimension is determined by the following classifier.
(35) 
From these two formulas, we find the embeddingDNN connections can be yielded from the embeddingsubnet connections: .
4. Practical Issues
This section discusses several practical issues, some of which are mentioned in Section 3.3 (e.g., initialization), the others are related to applications (e.g., data processing). Corresponding experiments are located at Section 5.2.
4.1. Data Processing
The data in user response prediction is usually categorical, and sometimes are numerical or multivalued. When the model input contains both onehot vectors and real values, this model is hard to train: (i) Onehot vectors are usually sparse and highdimensional while real values are dense and lowdimensional, therefore, they have quite different distributions. (ii) Different from real values, onehot vectors are not comparable in numbers. For these reasons, categorical and numerical data should be processed differently.
Numerical data is usually processed by bucketing/clustering: First, a numerical field is partitioned by a series of thresholds according to its histogram distribution. Then a real value is assigned to a bucket. Finally, the bucket identifier is used to replace the original value. For instance, , , . Another solution is trees (He et al., 2014). Since decision trees split data examples into several groups, each group can be regarded as a category.
Different from numerical and categorical data, set data takes multiple values in one data instance, e.g., an item has multiple tags. The permutation invariant property of set data has been studied in (Zaheer et al., 2017). In this paper, the set data embeddings are averaged before feeding to DNN.
4.2. Initialization
Initializing parameters with small random numbers is widely adopted, e.g., uniform or normal distribution with 0 mean and small variance. For a hidden layer in DNN, an empirical choice for standard deviation is
, where is the input size of that layer. Xavier initialization (Glorot and Bengio, 2010)takes both forward and backward passes into consideration: taking uniform distribution
as an example, the upper bound should be , where and are the input and output sizes of a hidden layer. This setting stabilizes activations and gradients of a hidden layer at the early stage of training.The above discussion is limited to (i) dense input, and (ii) fully connected layers. Fig. 6 shows an embedding layer followed by a fully connected layer. An embedding layer has sparse input and is fieldwisely connected, i.e., each field is connected with only a part of the neurons. Since an embedding layer usually has extremely large input dimension, typical standard deviations include: , , , and pretraining, where is a constant, is the input dimension, is the number of fields, and is the embedding size. Pretraining is used in (Zhang et al., 2016; Xiao et al., 2017). For convenience, we use random initialization in most experiments for endtoend training. And we compare random initialization with pretraining in Section 5.2.
4.3. Optimization
In this section, we discuss potential risks of adaptive optimization algorithms in the scenario of sparse input. Compared with SGD, adaptive optimization algorithms converge much faster, e.g., AdaGrad (Duchi et al., 2011), Adam (Kingma and Ba, 2014), FTRL (McMahan et al., 2013; Ta, 2015), among which Adam is an empirically good choice (Xu et al., 2015; Goodfellow et al., 2016).
Even though adaptive algorithms speed up training and sometimes escape from local minima in practice, there is no theoretical guarantee of better performance. Take Adam as an example,
(36)  
(37)  
(38) 
where and are estimations of the first and the second moments, is the real gradient at training step , is the estimated gradient at training step , and are decay coefficients, and is a small constant for numerical stability. Empirical values for , and are 0.9, 0.999, , respectively.
The gradient magnitude of logit decays exponentially at the early stage of training.
Note:FNN_ReLU, FNN_Tanh, FNN_ELU are FNNs with different activation functions. The xaxis means the number of minibatches fed in a model, and the minibatch size is 2000. The yaxis means the absolute gradient of the logit.
Before our discussion, we should notice that the gradient of the logit
(39) 
is bounded in . Fig. 7 shows
of typical models with SGD/Adam. From this figure, we find the logit gradient decays exponentially at the early stage of training. Because of chain rule and backpropagation, the gradients of any parameters depend on
. This indicates the gradients of any parameters decrease dramatically at the early stage of training.The following discussion uses Adam to analyze the behaviors of adaptive optimization algorithms on (unbalanced) sparse dataset, and the parameter sensitivity of Adam is studied in Section 5.2. Considering an input appears for first time in training examples at time , without loss of generality, we assume .
4.3.1. Sensitive Gradient
Firstly, we discuss the instant behavior of at time ,
(40) 
The estimated gradient mainly depends on , , and , as shown in Fig. 8(a)(c). At a certain training step , the estimated gradient changes dramatically when approaches some threshold. In another word, saturates across some of the value domain of . Denoting , , , we have . Then we can find the threshold where is maximal
(41)  
(42) 
Fig. 8(d)(f) show the thresholds within training steps. From these figures, we find increases with and . As training goes on, more and more gradients will cross the threshold and vanish. Thus, we conclude has a large impact on model convergence and training stability. And this impact will be amplified if the dataset is unbalanced. Suppose the positive ratio of a dataset is , then
(43) 
Thus is proportional to when .
4.3.2. Longtailed Gradient
Secondly, we discuss the longtailed behavior of in a time window after time . If a sparse input appears only once at training step , then and decays in a time window . Assume ,
(44) 
Fig. 9 illustrates gradient decay and cumulative gradient in a window at , respectively. From 9(a)(c) we can see, the gradient larger than a threshold (different from ) is scaled up to a “constant”, and the gradient smaller than that threshold shrinks to a “constant”. If is continuously updated to , the cumulative effect is shown in 9(d)(f). The longtailed effect may result in training instability or parameter divergence on sparse input. A solution is sparse update, i.e., the estimated moments and are updated normally, but the estimated gradient is only applied to parameters involved in the forward propagation.
4.4. Regularization
4.4.1. L2 Regularization
L2 regularization is usually used to control overfitting of latent vectors, yet it may cause severe computation overhead when the dataset is extremely sparse. Denoting as the embedding matrix, L2 regularization adds a term to the loss function. This term results in an extra gradient term . For sparse input, L2 regularization is very expensive because it will update all the parameters, usually of them, in .
An alternative to L2 regularization is sparse L2 regularization (Koren et al., 2009), i.e., we only penalize the parameters involved in the forward propagation rather than all of them. A simple implementation is to penalize instead of . Since is a binary input, indicates the parameters involved in the forward propagation.
4.4.2. Dropout
Dropout (Srivastava et al., 2014) is a technique developed for DNN, which introduces noise in training and improves robustness. However, in sparse data, a small minibatch tends to have a large bias, and dropout may amplify this bias. We conduct a simple experiment to show this problem, and the results are shown in Fig. 10.
We generate a categorical dataset from a distribution , , where every sample has 10 values without replacement. For a minibatch size , we draw samples as a batch and use this batch to estimate the distribution, . The evaluation metric is KL divergence, and we use a constant for numerical stability. For each , and , we sample 100 batches and the results are shown in Fig. 10(a), where the mean values are calculated and plotted in lines. From this figure, we conclude: (i) a minibatch tends to have a large bias on sparse data; (ii) this bias increases with data sparsity and decreases with batch size.
Then we test dropout on . Denoting dropout rate as , we randomly generate a mask of length with elements being 0 while others being to simulate the dropout process. Every sample is weighted by a mask before estimating the distribution. The results are shown in Fig. 10(b). This simple experiment illustrates the noise sensitivity of a minibatch in sparse data. Thus, we turn to normalization techniques when a dataset is extremely sparse.
4.4.3. DNN Normalization
Normalization is carefully studied recently to solve internal covariate shift (Ioffe and Szegedy, 2015)
in DNNs, and this method stabilizes activation distributions. Typical methods include batch normalization (BN)
(Ioffe and Szegedy, 2015), weight normalization (WN) (Salimans and Kingma, 2016), layer normalization (LN) (Ba et al., 2016), selfnormalizing network (SNN) (Klambauer et al., 2017), etc. In general, BN, WN, and LN use some statistics to normalize hidden layer activations, while SNN uses an activation function SELU, making activations naturally converge to the standard normal distribution.Denote as the input to a hidden layer with weight matrix , as the th instance of a minibatch, and as the value of the th dimension of . As the following, we discuss BN, WN, LN, and SELU in detail.
Batch Normalization
BN normalizes activations using statistics within a minibatch
(45) 
where and scale and shift the normalized values. These parameters are learned along with other parameters and restore the representation power of the network. Similar parameters are also used in WN and LN.
BN solves internal covariate shift (Ioffe and Szegedy, 2015) and accelerates DNN training. However, BN may fail when the input is sparse, because BN relies on the statistics of a minibatch. As shown in Fig. 10(a), a minibatch tends to have a large bias when input data is sparse, but large minibatch is not always practical due to the computation resource limit such as GPU video memory.
Weight Normalization
WN reparametrizes the weight matrix , and learns the direction and scale of separately
(46) 
WN does not depend on minibatch, thus can be applied to noisesensitive models (Salimans and Kingma, 2016)
. However, WN is roughly infeasible on highdimensional data, because WN depends on the L2 norm of parameters, which results in even higher complexity than L2 regularization. Thus, WN meets similar complexity problem as L2 regularization when input space is extremely sparse.
Layer Normalization
LN normalizes activations using statistics of different neurons within the same layer
(47) 
LN stabilizes the hidden state dynamics in recurrent networks (Ba et al., 2016). In our experiments, we apply LN on fully connected layers and inner/kernel product layers, and we apply fused LN in micro networks. Since LN does not work well in CNN (Ba et al., 2016), we exclude LN in CCPM.
Selfnormalizing Network
SNN uses SELU as the activation function,
(48) 
Based on Banach fixedpoint theorem, the activations that are propagated through many SELU layers will converge to zero mean and unit variance. Besides, SELU declares significant improvement in feedforward neural networks on a large variety of tasks
(Klambauer et al., 2017).To summarize, we use (sparse) L2 regularization to penalize embedding vectors, and we use dropout, LN, and SELU to regularize DNNs. BN is not applied because of the minibatch problem discussed above, and WN is not applied because of its high complexity. Corresponding experiments are in Section 5.2.
5. Experiments
In Section 5.1, we present overall comparison. In Section 5.2, we discuss practical issues: complexity, initialization (Section 4.2), optimization (Section 4.3), and regularization (Section 4.4). In Section 5.3, we propose a visualization method to analyze feature interactions, corresponding to Section 3.1. And finally, in Section 5.4, we conduct a synthetic experiment to illustrate the deficiency of DNN, corresponding to Section 3.2.
5.1. Offline and Online Evaluations
In this section, we conduct offline and online evaluations to give a thorough comparison: (i) We compare KFM/NIFM with other latent vectorbased models to verify the effectiveness of kernel product methods. (ii) We compare PNNs with other DNNbased models to verify the effectiveness of product layers. (iii) We also participate in the Criteo challenge and compete KFM with libFFM directly. (iv) We deploy PIN in a real recommender system.
5.1.1. Datasets
Criteo
Criteo^{7}^{7}7http://labs.criteo.com/downloads/downloadterabyteclicklogs/ contains one month of click logs with billions of data examples. A small subset of Criteo was published in Criteo Display Advertising Challenge, 2013, and FFM was the winning solution (Juan et al., 2016). We select “day612” for training, and “day13” for evaluation. Because of the enormous data volume and serious label unbalance (only 3% samples are positive), we apply negative downsampling to keep the positive ratio close to 50%. We convert the 13 numerical fields into categorical through bucketing (in Section 4.1). And we set the categories appearing less than 20 times as a dummy category “other”.
Avazu
Avazu^{8}^{8}8http://www.kaggle.com/c/avazuctrprediction was published in Avazu ClickThrough Rate Prediction contest, 2014, and FFM was the winning solution (Juan et al., 2016). We randomly split the public dataset into training and test sets at 4:1, and remove categories appearing less than 20 times to reduce dimensionality.
iPinYou
iPinYou^{9}^{9}9http://data.computationaladvertising.org was published in iPinYou RTB Bidding Algorithm Competition, 2013. We only use the click data from season 2 and 3 because of the same data schema. We follow the data processing of (Zhang et al., 2014b), and we remove “user tags” to prevent leakage.
Huawei
Huawei (Guo et al., 2017) is collected from the game center of Huawei App Store in 2016, containing app, user, and context features. We use the same training and test sets as (Guo et al., 2017), and we use the same hyperparameter settings to reproduce their results.
Table 4 shows statistics of the 4 datasets^{10}^{10}10Datasets: https://github.com/Atomu2014/AdsRecSysDatasets and https://github.com/Atomu2014/makeipinyoudata..
Dataset  # instances  # categories  # fields  pos ratio 

Criteo  39  0.5  
Avazu  24  0.17  
iPinYou  16  0.0007  
Huawei  9  0.008 
5.1.2. Compared Models
We compare 8 baseline models, including LR (Lee et al., 2012), GBDT (Chen and Guestrin, 2016), FM (Rendle, 2010), FFM (Juan et al., 2016), FNN (Zhang et al., 2016), CCPM (Liu et al., 2015), AFM (Xiao et al., 2017) and DeepFM (Guo et al., 2017), all of which are discussed in Section 2 and Section 3
. We use XGBoost
^{11}^{11}11https://xgboost.readthedocs.io/en/latest/ and libFFM^{12}^{12}12https://github.com/guestwalk/libffm as GBDT and FFM in our experiments. We implement^{13}^{13}13Code: https://github.com/Atomu2014/productnetsdistributedall the other models with Tensorflow
^{14}^{14}14https://www.tensorflow.org/ and MXNet^{15}^{15}15https://mxnet.apache.org/. We also implement FFM with Tensorflow and MXNet to compare its training speed with other models on GPU. In particular, our FFM implementation (Avazu log loss=0.37805) has almost the same performance as libFFM (Avazu log loss=0.37803).5.1.3. Evaluation Metrics
The evaluation metrics are AUC, and log loss. AUC is a widely used metric for binary classification because it is insensitive to the classification threshold and the positive ratio. If prediction scores of all the positive samples are higher than those of the negative, the model will achieve AUC=1 (separate positive/negative samples perfectly). The upper bound of AUC is 1, and the larger the better. Log loss is another widely used metric in binary classification, measuring the distance between two distributions. The lower bound of log loss is 0, indicating the two distributions perfectly match, and a smaller value indicates better performance.
5.1.4. Parameter Setting
Param  Criteo  Avazu  iPinYou  Huawei  






LR          
GBDT 





FFM  k=4  k=4  k=4  k=4  






AFM 





NIFM  subnet=[40,1]  subnet=[80,1]  subnet=[40,1]  subnet=[20,1]  
CCPM 











PIN 




Note: bs=Batch Size, opt=Optimizer, lr=Learning Rate, l2=L2 Regularization on Embedding Layer, k=Embedding Size, kernel=Convolution Kernel Size, net=DNN Structure, subnet=Micro Network, t=Softmax Temperature, l2_a= L2 Regularization on Attention Network, h=Attention Network Hidden Size, drop=Dropout Rate, LN=Layer Normalization (T: True, F: False)
Table 5 shows key hyperparameters of the models.
For a fair comparison, on Criteo, Avazu, and iPinYou, we (i) fix the embedding size according to the bestperformed FM (searched among {10, 20, 40, 80}), and (ii) fix the DNN structure according to the bestperformed FNN (width searched in [100, 1000], depth searched in [1, 9]). In terms of initialization, we initialize DNN hidden layers with xavier (Glorot and Bengio, 2010), and we initialize the embedding vectors from uniform distributions (range selected from , , as discussed in Section 4.2.). For Huawei, we follow the parameter settings of (Guo et al., 2017).
With these constraints, all latent vectorbased models have the same embedding size, and all DNNbased models additionally have the same DNN classifier. Therefore, all these models have similar amounts of parameters and are evaluated with the same training efforts. We also conduct parameter study on 4 typical models, where grid search is performed.
5.1.5. Overall Performance
Model  AUC (%)  Log Loss  AUC (%)  Log Loss  AUC (%)  Log Loss  AUC (%)  Log Loss 

LR  78.00  0.5631  76.76  0.3868  76.38  0.005691  86.40  0.02648 
GBDT  78.62  0.5560  77.53  0.3824  76.90  0.005578  86.45  0.02656 
FM  79.09  0.5500  77.93  0.3805  77.17  0.005595  86.78  0.02633 
FFM  79.80  0.5438  78.31  0.3781  76.18  0.005695  87.04  0.02626 
CCPM  79.55  0.5469  78.12  0.3800  77.53  0.005640  86.92  0.02633 
FNN  79.87  0.5428  78.30  0.3778  77.82  0.005573  86.83  0.02629 
AFM  79.13  0.5517  78.06  0.3794  77.71  0.005562  86.89  0.02649 
DeepFM  79.91  0.5423  78.36  0.3777  77.92  0.005588  87.15  0.02618 
KFM  79.85  0.5427  78.40  0.3775  76.90  0.005630  87.00  0.02624 
NIFM  79.80  0.5437  78.13  0.3788  77.07  0.005607  87.16  0.02620 
IPNN  80.13  0.5399  78.68  0.3757  78.17  0.005549  87.27  0.02617 
KPNN  80.17  0.5394  78.71  0.3756  78.21  0.005563  87.28  0.02617 
PIN  80.21  0.5390  78.72  0.3755  78.22  0.005547  87.30  0.02614 
Comments
There are no comments yet.