1. Introduction
Normalization has become one of the most fundamental components in many deep neural networks for machine learning tasks, especially in Computer Vision (CV) and Natural Language Processing (NLP). However, very different kinds of normalization are used in CV and NLP. For example, Batch Normalization (BatchNorm or BN)
(Ioffe and Szegedy, 2015) is widely adopted in CV, but it leads to significant performance degradation when naively used in NLP. Instead, Layer Normalization (LayerNorm or LN) (Ba et al., 2016) is the standard normalization method utilized in NLP.On the other side, deep neural network has also been widely used in CTR estimation field (McMahan et al., 2013; Zhou et al., 2018; He et al., 2014; Xiao et al., 2017; Qu et al., 2016; He and Chua, 2017; Zhang et al., 2016; Cheng et al., 2016; Guo et al., 2017; Lian et al., 2018; Beutel et al., 2018)
. Some deep learning based models have been introduced and achieved success such as wide & deep
(Cheng et al., 2016), DeepFM(Guo et al., 2017) and xDeepFM(Lian et al., 2018) .Most DNN ranking models use the feature embedding to represent information and shallow MLP layers to model highorder interactions in an implicit way. These two commonly used components play important roles in current stateoftheart ranking systems.Taking the both sides into consideration, we care about the following questions: What’s the effect of various normalization methods on deep neural network models for CTR estimation? Is there one method outperforming other normalization approaches just like LayerNorm in NLP or BatchNorm in CV? What’s the reason if some normalization works?
Among most of the proposed deep neural network models, few of them utilize normalization approaches. Though some works such as DCN(Wang et al., 2017) and Neural Factorization Machine (NFM) (He and Chua, 2017) use BatchNorm in MLP part of the structure, there isn’t work to thoroughly explore the effect of the normalization on the DNN ranking systems. In this paper, we conduct a systematic study on the effect of widely used normalization schemas by applying the various normalization approaches to both feature embedding and MLP part in DNN model. Experimental results show the correct normalization helps the training of DNN models and boosts the model performance with a large margin. We also simplify the LayerNorm and propose a new and effective normalization method in this work. A normalization enhanced DNN model named NormDNN is also proposed based on the abovementioned observation. Further more, we find the variance of normalization mainly contributes to this positive effect. To the best of our knowledge, this is the first work to verify the importance of normalization on DNN ranking system through systematic study.
The contributions of our work are summarized as follows:

In this work, we propose a new normalization approach based on LayerNorm: varianceonly LayerNorm(VOLN). The experimental results show that the proposed normalization method has comparable performance with layer normalization and significantly enhance DNN model’s performance.

We apply various normalization approaches to the feature embedding part and the MLP part of DNN model, including commonly used normalization and our proposed approach. Extensive experiments are conduct on three realworld datasets and the experiment results demonstrate that the correct normalization or normalization combination significantly enhances model’s performance. As far as we know, this is the first work to apply normalization to feature embedding and prove its effectiveness.

We propose NormDNN model in this paper which is a normalization enhanced DNN adopting the following normalization strategy: varianceonly LayerNorm or LayerNorm for numerical feature, BatchNorm for categorical feature and varianceonly LayerNorm for MLP. NormDNN achieves significantly better performance than complex model such as xDeepFM. NormDNN is more applicable in many industry applications because of its better performance and high computation efficiency compared with many stateoftheart complex neural network models.

To prove the universal validity of normalization for neural network ranking model, we also apply several normalization approaches to DeepFM and xDeepFM model. The experiments results imply that the correct normalization also boosts these model’s performances with a large margin.

As for the reason why normalization works for DNN models in CTR estimation, we find that the variance of normalization plays the main role and give an explanation in this paper.
The rest of this paper is organized as follows. Section 2 introduces some related works which are relevant with our work. We introduce our proposed models in detail in Section 3. The experimental results on three real world datasets are presented and discussed in Section 4. Section 5 concludes our work in this paper.
2. Related Work
2.1. Normalization
Normalization techniques have been recognized as very effective components in deep learning. Many normalization approaches have been proposed with the three most popular ones being BatchNorm(Ioffe and Szegedy, 2015), LayerNorm (Ba et al., 2016) and GroupNorm(Wu and He, 2018). Batch Normalization (Batch Norm or BN)(Ioffe and Szegedy, 2015) normalizes the features by the mean and variance computed within a minibatch. This has been shown by many practices to ease optimization and enable very deep networks to converge. Another example is layer normalization (Layer Norm or LN)(Ba et al., 2016)
which was proposed to ease optimization of recurrent neural networks. Statistics of layer normalization are not computed across the N samples in a minibatch but are estimated in a layerwise manner for each sample independently. It’s an easy way to extend LayerNorm to GroupNorm (GN)
(Wu and He, 2018), where the normalization is performed across a partition of the features/channels with different predefined groups. Normalization methods have shown success in accelerating the training of deep networks. In general, BatchNorm (Ioffe and Szegedy, 2015) and GroupNorm (Wu and He, 2018) are widely adopted in CV and LayerNorm (Ba et al., 2016) is the standard normalization scheme used in NLP.Another line of research on normalization is to understand why BatchNorm helps training in CV and LayerNorm helps training in NLP. For example, the original explanation was that BatchNorm reduces the socalled ”Internal Covariance Shift” (Ioffe and Szegedy, 2015). However, this explanation was viewed as incorrect or incomplete and the study of (Santurkar et al., 2018) argued that the underlying reason that BatchNorm helps training is that it results in a smoother loss landscape. Shen etc (Shen et al., 2020) perform a systematic study of NLP transformer models to understand why BatchNorm has a poor performance and find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training which results in instability. Xu etc (Xu et al., 2019) find that the derivatives of the mean and variance in LayerNorm used in Transformer for NLP tasks are more important than forward normalization by recentering and rescaling backward gradients. Furthermore, they find that the parameters of LayerNorm, including the bias and gain, increase the risk of overfitting and do not work in most cases.
2.2. Neural Network based CTR Models
Many deep learning based CTR models have been proposed in recent years and it is the key factor for most of these neural network based models to effectively model the feature interactions.
FactorizationMachine Supported Neural Networks (FNN)(Zhang et al., 2016)
is a feedforward neural network using FM to pretrain the embedding layer. Wide & Deep Learning
(Cheng et al., 2016) jointly trains wide linear models and deep neural networks to combine the benefits of memorization and generalization for recommender systems. However, expertise feature engineering is still needed on the input to the wide part of Wide & Deep model. To alleviate manual efforts in feature engineering, DeepFM(Guo et al., 2017) replaces the wide part of Wide & Deep model with FM and shares the feature embedding between the FM and deep component.While most DNN ranking models process highorder feature interactions in implicit way, some works explicitly introduce highorder feature interactions by subnetwork. Deep & Cross Network (DCN)(Wang et al., 2017) efficiently captures feature interactions of bounded degrees in an explicit fashion. Similarly, eXtreme Deep Factorization Machine (xDeepFM) (Lian et al., 2018) also models the loworder and highorder feature interactions in an explicit way by proposing a novel Compressed Interaction Network (CIN) part. AutoInt(Song et al., 2019)
proposes a multihead selfattentive neural network with residual connections to explicitly model the feature interactions in the lowdimensional space.
3. Our Work
In this section, we first describe the proposed varianceonly LayerNorm. We conduct extensive experiments to verify the effectiveness of normalization in section 4 and the details about how to apply the normalization on feature embedding and MLP will be introduced in this section. Finally the reason why normalization works is introduced.
3.1. VarianceOnly LayerNorm
First, we briefly review the formulation of LayerNorm. Let
denotes the vector representation of an input of size
to normalization layers. LayerNorm recenters and rescales input as(1) 
where is the output of a LayerNorm layer. is a dot production operation. and
are the mean and standard deviation of input. Bias
and gain are parameters with the same dimension .As we all know, LayerNorm has been widely used and proved to be effective in many NLP tasks. However, Xu etc.(Xu et al., 2019) point out that the parameters of LayerNorm, including the bias and gain, increase the risk of overfitting and do not work in most cases. Their experiments in four NLP datasets show that a simple version of LayerNorm without the bias and gain outperforms LayerNorm. Though their conclusion is draw mainly from the NLP tasks and they primarily consider normalization on Transformer and TransformerXL networks. We wonder whether the same conclusion can be draw in DNN ranking models and design a similar simper version LayerNorm by removing bias and gain from LN as follows:
(2) 
We call this version LayerNorm simpleLayerNorm (SLN) just as the original paper (Xu et al., 2019) named. Our experimental results show that simpleLayerNorm has comparable performance with LayerNorm, which implies the bias and gain in LayerNorm bring neither good nor bad effect to DNN models in CTR estimation field. Our conclusion is slightly different from that in NLP field because their experimental results (Xu et al., 2019) show the advantages for simpleLayerNorm over the standard LayerNorm in several NLP tasks. We deem that may come from the difference of network structure and research field.
According to our empirical observations, we find recentering the input x in simpleLayerNorm has little effect on the performance of DNN ranking model. So we propose varianceonly LayerNorm(VOLN) by further removing the mean from simpleLayerNorm as follows:
(3) 
Though the varianceonly LayerNorm seems rather simple, our experimental results demonstrate it has comparable or even better performance in several CTR datasets than standard LayerNorm.
3.2. NormDNN
Most DNN ranking models use the feature embedding to represent information and shallow MLP layers to model highorder interactions in an implicit way. These two commonly used components play important roles in current stateoftheart ranking systems. So we have three options to apply normalization: normalization only on feature embedding, normalization only on MLP part, normalization both on feature embedding and MLP part.
We also find different parts of DNN model need different normalization method and propose the following unified normalization combination strategy: varianceonly LayerNorm or LayerNorm for numerical feature, BatchNorm for categorical feature and varianceonly LayerNorm for MLP. We call this normalization enhanced DNN model with this unified normalization strategy ”NormDNN” in this paper. NormDNN achieves significantly better performance than complex model such as xDeepFM. We will discuss this in Section 4.6.
3.2.1. Normalization on Feature Embedding
The input data of CTR tasks usually consists of sparse and dense features and the sparse features are mostly categorical type. Such features are encoded as onehot vectors which often lead to excessively highdimensional feature spaces for large vocabularies. The common solution to this problem is to introduce the embedding layer.
An embedding layer is applied upon the raw feature input to compress it to a low dimensional, dense realvalue vector. The result of embedding layer is a wide concatenated vector:
(4) 
where denotes the number of fields, and denotes the embedding of one field. Although the feature lengths of instances can be various, their embedding are of the same length , where is the dimension of field embedding.
As we all know, features in CTR tasks usually can be segregated into categorical features and numerical features. There are two widely used approaches to convert the numerical feature into embedding. The first one is to quantize each numerical feature into discrete buckets, and the feature is then represented by the bucket ID. We can map bucket ID to an embedding vector. The second method maps the feature field into an embedding vector as follows:
(5) 
where is an embedding vector for field , and is a scalar value. In our experiments, we adopt the second approach to convert numerical features into embedding.
We apply normalization on feature embedding based on the feature field as follows:
(6) 
where can be BatchNorm, LayerNorm, GroupNorm, SimpleLayerNorm or varianceonly LayerNorm. The bias and gain are shared for features in same feature field if the normalization contains these parameters.
For the LayerNorm based normalization approaches (LayerNorm, SimpleLayerNorm and varianceonly LayerNorm), we regard each feature’s embedding as a layer to compute the mean and variance of normalization. As for the GroupNorm, the feature embedding is divided into several groups to compute mean and variance. BatchNorm computes the statistics within a minibatch.
In real life applications, the CTR tasks usually contain both categorical features and numerical features. We find the different kinds of feature need corresponding normalization method and we will discuss this in detail in Section 4.2.
3.2.2. Normalization on MLP Part
As for the feedforward layer in DNN model, the normalization on MLP is just as usual method does. That is to say, BatchNorm’s mean and variance are computed within a minibatch and LayerNorm based normalizations’s statistics are estimated in a layerwise manner. As for the GroupNorm, we can divide the neural units contained in MLP into several groups and the statistics are estimated in groupwise manner.
Notice that we have two places to put normalization operation on the MLP: one place is before nonlinear operation and another place is after nonlinear operation. For clarity of the description, we use LayerNorm as an example. If we put normalization after nonlinear operation, we have:
(7) 
where refers to the input of feedforward layer, are parameters for the layer, and respectively denotes the size of input layer and neural number of feedforward layer.
If we put normalization before nonlinear operation, we have:
(8) 
We find the performance of the normalization before nonlinear consistently outperforms that of the normalization after nonlinear operation. So all the normalization used in MLP part is put before nonlinear operation in our paper.
3.3. Understanding Why Normalization Works
In this section, we discuss the reason why normalization works in DNN model. The related experimental results are presented in Section 4.2.
As mentioned in Section 3.1, we find three LayerNorm based models( LayerNorm, SimpleLayerNorm and varianceonly LayerNorm) have comparable performance on three reallife datasets and all the three normalization approaches work. From these observations, we can draw the following conclusion: Because the simple LayerNorm just removes the bias and gain from LayerNorm, the similar model performance implies the bias and gain have no effect on the final model performance. Further more, the varianceonly LayerNorm removes the mean from simple LayerNorm and it has comparable performance with LayerNorm and simple LayerNorm. That implies the mean in simple LayerNorm doesn’t contribute to the final better performance. So we can draw the conclusion that it’s variance in varianceonly LayerNorm that helps boosting model’s performance. The reason why LayerNorm and simple LayerNorm also work lies in that they contain variance.
To understand how the variance influences the model performance, we analyze the change of statistics both in embedding layer and MLP after applying varianceonly LayerNorm on Cretio dataset (Figure. 1). From Figure 1,we can see that the average mean and variance of feature embedding and MLP are very small positive number if we don’t apply normalization on any part of DNN model. If we use varianceonly LayerNorm only on feature embedding, the variance of feature embedding greatly increases and that change pushes bit value of many feature embedding to a much larger negative number (Figure 1 and Figure 2). Through the network connections, these statistics changes of feature embedding are transferred to the MLP layer and the corresponding statistics of MLP show the similar trend even though we didn’t apply any normalization on it (Figure 1
). If we just apply varianceonly LayerNorm on MLP of DNN model, we see the similar changes that the average variance of MLP neurons greatly increases and that also pushes output of many neurons to negative number (Figure
3). If we utilize varianceonly LayerNorm both on feature embedding and MLP, we observe the similar trend.As for the influence of variance on MLP, we can see that large fraction of neuron outputs is pushed into negative number from small positive number after applying varianceonly LayerNorm (Figure 3
). That means these neuron responses were removed because the following nonlinear function is ReLU. We deem this avoid many noises in MLP responses and accelerate the training of the model because of the introduction of the variance.
As for the influence of variance on feature embedding, we can analyze the effect of normalization from another viewpoint. As we all know, the features in CTR tasks are very sparse and there are a large amount of low frequency features. This will lead to the underfitting of these long tail feature’s embedding because there is less training data for them. The underfitting embedding may contain noise which brings difficulty for feedforward layer to capture complex feature interactions. We think the correct normalization on feature embedding can alleviate this situation.
We can derive the derivative of with respect to input after inserting varianceonly LayerNorm as follows. Assume the derivative of with respect to is given, ie. is known. Then the derivative with respect to input can be write as:
(9) 
(10) 
where and are the mean and standard deviation of input. is the size of input.
We can see that the second subitem of formula (10) is nearly zero and can be ignored during the beginning phrase of the model training because parameter initiation approach usually sets the initial value of parameters to be very little random number near zero. So it’s the first subitem that mainly influences the derivative of respect to input . From figure 1, we know that the variance before normalization is rather small positive number. That is to say, the derivative will be made much larger because of the introduction of varianceonly LayerNorm. This means the loss will become much more sensitive to the little change of input because of the existence of varianceonly LayerNorm. The variance brings faster convergence for low frequency feature embedding and alleviates the underfitting of these feature embeddings.
4. Experimental Results
In this section, we empirically evaluate the effect of various normalization approaches on deep neural networks on three realworld datasets and conduct detailed studies to answer the following research questions:

RQ1 What’s the effect of various normalization approaches applied only on feature embedding part of DNN model?

RQ2 What’s the effect of various normalizations approaches applied only on MLP part of DNN model?

RQ3 What’s the effect of various normalization approaches applied both on feature embedding part and the MLP part of DNN model? Does our proposed varianceonly LayerNorm work?

RQ4 Does categorical feature or numerical feature need specific normalization?

RQ5 Is there a best normalization combination for DNN model?

RQ6 Can we draw the similar conclusion about the effect of normalization in other stateoftheart models such as DeepFM or xDeepFM?
In the following, we will first describe the experimental settings, followed by answering the above research questions.
4.1. Experiment Setup
4.1.1. Datasets
The following three data sets are used in our experiments:

Criteo^{1}^{1}1Criteo http://labs.criteo.com/downloads/downloadterabyteclicklogs/ Dataset: As a very famous public real world display ad dataset with each ad display information and corresponding user click feedback, Criteo data set is widely used in many CTR model evaluation. There are 26 anonymous categorical fields and 13 continuous feature fields in Criteo data set.

Malware ^{2}^{2}2Malware https://www.kaggle.com/c/malwareclassification Dataset: Malware is a dataset from Kaggle competitions published in the Microsoft Malware Classification Challenge. It is almost half a terabyte when uncompressed and consists of disassembly and bytecode malware files representing a mix of 9 different families. All the feature fields are categorical.

Avazu^{3}^{3}3Avazu http://www.kaggle.com/c/avazuctrprediction Dataset: The Avazu dataset consists of several days of ad click through data which is ordered chronologically. For each click data, there are 24 fields which indicate elements of a single ad impression.
We randomly split instances by for training , validation and test while Table 1 lists the statistics of the evaluation datasets. For these datasets, a small improvement in prediction accuracy is regarded as practically significant because it will bring a large increase in a company’s revenue if the company has a very large user base.
Datasets  #Instances  #fields  #features 

Criteo  45M  39(26 Cat;13 Num)  30M 
Malware  8.92M  82(all Cat)  9.89M 
Avazu  40.43M  24(all Cat)  0.64M 
4.1.2. Evaluation Metric
AUC (Area Under ROC) is used in our experiments as the evaluation metrics. This metric is very popular for binary classification tasks. AUC is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. It is insensitive to the classification threshold and the positive ratio. AUC’s upper bound is
and larger value indicates a better performance.4.1.3. Models for Comparisons
We mainly use DNN model as the baseline to evaluation the effect of various normalization methods because it’s a commonly used component in many current neural network models. DeepFM and xDeepFM are also regarded as another two baselines to further verify the effectiveness of these approaches. Among these baseline models, DNN and DeepFM implicitly capture high order interactions while xDeepFM models high order interactions in explicit way.
4.1.4. Implementation Details
We implement all the models with Tensorflow in our experiments. For optimization method, we use the Adam with a minibatch size of 1000 and a learning rate is set to
. Focusing on normalization approaches in our paper, we make the dimension of field embedding for all models to be a fixed value of . For models with DNN part, the depth of hidden layers is set to , the number of neurons per layer is, all activation function are ReLU. We conduct our experiments with
Tesla GPUs.Criteo  Avazu  Malware  
DNN  0.8054  0.7820  0.7263 
+BN  0.8066  0.7847  0.7364 
+GN2  0.8096  0.7835  0.7330 
+LN  0.8093  0.7848  0.7341 
+SLN  0.8093  0.7843  0.7343 
+VOLN  0.8094  0.7839  0.7358 
4.2. Effectiveness of Normalization on Feature Embedding (RQ1)
To verify the effectiveness of various normalization approaches on DNN models, comparison experiments are conduced on three evaluation datasets. We add different kinds of normalization either on the embedding layer or hidden layer of a standard DNN model which has 3 MLP layers with 400 neurons per layer. As for GourpNorm, we find GN with 2 groups perform best compared with other group number setting. So we only report the experimental results with this setting (GN2). We find normalization on feature embedding helps boosting DNN model’s performance. The Table 2 shows the experimental results. From Table 2, we have the following observations:

If we add normalization on feature embedding, all normalization methods help model’s training on three datasets, including BatchNorm, GroupNorm ,LayerNorm, Simple LN and our proposed varianceonly LN.

As for some widely used normalization, BatchNorm performs best on Malware dataset while worst on Criteo dataset. That implies the performance of BatchNorm depends on specific datasets and GroupNorm shows the similar trend. LayerNorm keeps relatively high performance on all three datasets.

As for LayerNorm based normalizations, LayerNorm,SimpleLN and varianceonly LN show comparable performance on all three datasets.
Criteo  Avazu  Malware  
DNN  0.8054  0.7820  0.7263 
+BN  0.8071  0.7836  0.7388 
+GN2  0.8073  0.7836  0.7388 
+LN  0.8071  0.7851  0.7378 
+SLN  0.8070  0.7847  0.7376 
+VOLN  0.8075  0.7849  0.7373 
4.3. Effectiveness of Normalization on MLP Part (RQ2)
We also conduct experiments to apply the different kinds of normalization only on MLP part of DNN model. The overall performances of DNN model with different normalizations on three evaluation datasets are show in the Table 3. From the experimental results, we can see that:

Various normalization approaches show comparable performance on both Criteo and Malware datasets. BatchNorm and GroupNorm slightly underperform LayerNorm based approaches on Avazu dataset.

Compared with normalization only on feature embedding, normalization only on MLP part performs better on Malware dataset and worse on Criteo dataset on the whole. That may imply that selection between the normalization on embedding and MLP part depends on specific task.
Criteo  Avazu  Malware  
EMB  MLP  
w/o  w/o  0.8054  0.7820  0.7263 
+BN  +BN  0.8068  0.7845  0.7393 
+BN  +LN  0.8075  0.7863  0.7393 
+BN  +VOLN  0.8077  0.7869  0.7402 
+LN  +BN  0.8094  0.7838  0.7387 
+LN  +LN  0.8096  0.7852  0.7372 
+LN  +VOLN  0.8098  0.7857  0.7372 
+VOLN  +BN  0.8092  0.7823  0.7394 
+VOLN  +LN  0.8092  0.7841  0.7376 
+VOLN  +VOLN  0.8097  0.7850  0.7383 
4.4. Normalization Combination on Both Feature Embedding and MLP (RQ3)
As discussed in Section 4.2, we can apply the normalization both on the feature embedding part and MLP part of DNN model. Extensive experiments have been conducted and we find the following three normalizations perform better when we combine various normalizations in different part of DNN model: BatchNorm, LayerNorm and Varianceonly LN. So we present experimental results of 9 combinations in Table 4.
From the results in Table 4, we can see that:

If we choose the correct normalization combination, the DNN model performs better than any model which only uses normalization in one part of DNN model, either feature embedding or MLP part. That means they are complementary and it’s better to use them both under reallife applications.

Compared with a standard DNN model, the performances of DNN model with normalizations outperform baseline with a large margin when correct normalization combination are selected.

If we adopt BatchNorm in normalization combination, the conclusion that its performance depends on dataset still holds.

Choosing varianceonly LayerNorm in MLP part, we usually have relatively higher performance models, no matter which normalization is used in feature embedding part. It tells us that we’d better use VOLN as normalization in MLP part when we try to combine the normalizations.
4.5. Normalization for Numerical and Categorical Feature (RQ4)
From the experimental results shown in Table 1, we observe that the performance degrades if we adopt BatchNorm instead of LayerNorm based approaches on feature embedding on Criteo dataset. Considering only the Criteo dataset contains both categorical and numerical feature, we assume that the performance difference is related to the numerical or categorical feature. So we design some normalization combination experiments to testify this assumption. As discussed in Section 4.4, we fix the normalization used in MLP to be varianceonly LayerNorm and apply different normalization for numerical and categorical feature. The experimental results can be seen in Table 5.
EMB  MLP  

Num  Cat  
w/o  w/o  w/o  0.8054 
+BN  +BN  +VOLN  0.8077 
+BN  +LN  +VOLN  0.8068 
+BN  +VOLN  +VOLN  0.8066 
+LN  +LN  +VOLN  0.8097 
+LN  +BN  +VOLN  0.8105 
+VOLN  +BN  +VOLN  0.8107 
+VOLN  +VOLN  +VOLN  0.8098 
From the results in Table 5, we can see that:

For numerical features, performance of model with LayerNorm or varianceonly LayerNorm outperforms the model with BatchNorm. That implies we should utilize LayerNorm based approaches for numerical features.

If we use LayerNorm based normalization for numerical feature and varianceonly LayerNorm in MLP, we can see from Table 5 that we have best performance model with BatchNorm for categorical feature.
4.6. Performance of NormDNN (RQ5)
Criteo  Avazu  Malware  
DNN  0.8054  0.7820  0.7263 
DeepFM  0.8056  0.7833  0.7295 
xDeepFM  0.8063  0.7848  0.7322 
NormDNN  0.8107  0.7869  0.7402 
If we adopt the following unified normalization combination strategy in DNN ranking model: varianceonly LayerNorm or LayerNorm for numerical feature, BatchNorm for categorical feature and varianceonly LayerNorm for MLP, we can gain the best performance model on all three datasets, which achieves significantly better performance than complex model such as xDeepFM. We call this normalization enhanced model with this unified normalization strategy ”NormDNN” in this paper.
The experimental results in Table 6 prove this observation. It is easy to find that NormDNN is more applicable in many industry applications because of its better performance and high computation efficiency compared with many stateoftheart complex neural network models.
DNN  DeepFM  xDeepFM  
EMB  MLP  
w/o  w/o  0.8054  0.8056  0.8063 
+LN  w/o  0.8093  0.8100  0.8100 
+VOLN  w/o  0.8094  0.8099  0.8100 
w/o  +LN  0.8071  0.8073  0.8075 
w/o  +VOLN  0.8075  0.8076  0.8073 
+LN  +LN  0.8096  0.8099  0.8100 
+LN  +VOLN  0.8098  0.8102  0.8103 
+VOLN  +VOLN  0.8097  0.8101  0.8101 
4.7. Normalization on DeepFM and xDeepFM Models (RQ6)
In the following part of the paper, we study the impacts of normalization on two other popular deep neural network models, including DeepFM and xDeepFM. We design some normalization experiments to observe whether it also works for these two models. Notice that the input of FM component in DeepFM is the feature embedding before normalization. The performance of DeepFM degrades if FM component uses the same feature embedding after normalization as DNN component does.
The results in Table 7 show the impact of the various normalizations on model performance. It can be observed that:

The performances of both models apparently increase when we add normalization into different parts of model. The experimental results tell us the normalization works for many current stateoftheart models.

If we select correct normalization combination for simple model such as DNN or DeepFM, the performances of the model with normalization outperform complex model without normalization such as xDeepFM. That means it’s more practical to adding normalization on simple models in reallife applications.
5. Conclusion
In this paper, we firstly apply various normalization approaches to the feature embedding part and the MLP part of DNN model. Extensive experiments are conduct on three realworld datasets and the experiment results demonstrate that the correct normalization significantly enhances model’s performance. We also simplify the LayerNorm and propose two new and effective normalization methods in this work. Further more, we find the variance of normalization mainly contributes to this positive effect.
References
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §2.1.
 Latent cross: making use of context in recurrent recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 46–54. Cited by: §1.
 Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §1, §2.2.
 DeepFM: a factorizationmachine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: §1, §2.2.
 Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pp. 355–364. Cited by: §1, §1.
 Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, §2.1, §2.1.
 Xdeepfm: combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1754–1763. Cited by: §1, §2.2.
 Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 1222–1230. External Links: ISBN 9781450321747, Link, Document Cited by: §1.
 Productbased neural networks for user response prediction. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 1149–1154. Cited by: §1.
 How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §2.1.
 Rethinking batch normalization in transformers. arXiv preprint arXiv:2003.07845. Cited by: §2.1.
 Autoint: automatic feature interaction learning via selfattentive neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 1161–1170. Cited by: §2.2.
 Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 12. Cited by: §1, §2.2.
 Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §2.1.
 Attentional factorization machines: learning the weight of feature interactions via attention networks. arXiv preprint arXiv:1708.04617. Cited by: §1.
 Understanding and improving layer normalization. In Advances in Neural Information Processing Systems, pp. 4383–4393. Cited by: §2.1, §3.1, §3.1.
 Deep learning over multifield categorical data. In European conference on information retrieval, pp. 45–57. Cited by: §1, §2.2.
 Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. Cited by: §1.
Comments
There are no comments yet.