1. Introduction
Clickthrough rate (CTR) prediction is crucial in recommender systems, where the task is to predict the probability of the user clicking on the recommended items (e.g., movie, advertisement)
(Cheng et al., 2016; Zhou et al., 2018). Many recommendation decisions can then be made based on the predicted CTR. The core of these recommender systems is to extract significant loworder and highorder feature interactions.Explicit feature interactions can significantly improve the performance of CTR models (Guo et al., 2017; McMahan et al., 2013; He et al., 2014; Beutel et al., 2018). Early collaborative filtering recommendation algorithms, such as matrix factorization (MF) (Koren et al., 2009) and factorization machine (FM) (Rendle, 2010), extract secondorder information with a bilinear learning model.
However, not all interactions are conducive to performance. Some treebased methods have been proposed to find useful intersections automatically. Gradient boosting decision tree (GBDT)
(Chen and Guestrin, 2016; He et al., 2014)tries to find the interactions with higher gradients of the loss function. AutoCross
(Luo et al., 2019) searches effective interactions in a treestructured space. But tree models can only explore a small fraction of all possible feature interactions in recommender systems with multifield categorical data (Qu et al., 2019), so that their exploration ability is restricted.In the meantime, Deep Neural Network (DNN) models
(Covington et al., 2016) are proposed. Their representational ability is stronger and they could explore most of the feature interactions according to the universal approximation property (Hornik et al., 1989). However, there is no guarantee that a DNN naturally converges to any expected functions using gradientbased optimization. A recent work proves the insensitive gradient issue of DNN when the target is a large collection of uncorrelated functions (ShalevShwartz et al., 2017; Qu et al., 2019). Simple DNN models may not find the proper feature interactions. Therefore, various complicated architectures have been proposed, such as Deep Interest Network (DIN) (Zhou et al., 2018), Deep Factorization Machine (DeepFM) (Guo et al., 2017), Productbased Neural Network (PNN) (Qu et al., 2019), and Wide & Deep (Cheng et al., 2016). Factorization Models (specified in Definition 1), such as FM, DeepFM, PNN, Attention Factorization Machine (AFM) (Xiao et al., 2017), Neural Factorization Machine (NFM) (He and Chua, 2017), have been proposed to adopt a feature extractor to explore explicit feature interactions.However, all these models are simply either enumerating all feature interactions or requiring human efforts to identify important feature interactions. The former always brings large memory and computation cost to the model and is difficult to be extended into highorder interactions. Besides, useless interactions may bring unnecessary noise and complicate the training process (ShalevShwartz et al., 2017). The latter, such as identifying important interactions manually in Wide & Deep (Cheng et al., 2016), is of high labor cost and risks missing some counterintuitive (but important) interactions.
If useful feature interactions can be identified beforehand in these factorization models, the models can focus on learning over them without having to deal with useless feature interactions. Through removing the useless or even harmful interactions, we would expect the model to perform better with reduced computation cost.
To automatically learn which feature interactions are essential, we introduce a gate (in open or closed status) for each feature interaction to control whether its output should be passed to the next layer. In previous works, the status of the gates is either specified beforehand by expert knowledge (Cheng et al., 2016) or set as all open (Guo et al., 2017; Lian et al., 2018). From a datadriven point of view, whether open or closed a gate should depend on the contribution of each feature interaction to the final prediction. Apparently, those contributing little should be closed to prevent introducing extra noise to model learning. However, it is an NPHard problem to find the optimal set of open gates for model performance, as we face an incredibly huge (, with the number of feature fields, if we only consider order feature interactions) space to search.
Inspired by the recent work DARTS (Liu et al., 2019b) for neural architecture search, we propose a twostage method AutoFIS for automatic selecting loworder and highorder feature interactions in factorization models. In the search stage, instead of searching over a discrete set of candidate feature interactions, we relax the choices to be continuous by introducing a set of architecture parameters (one for each feature interaction) so that the relative importance of each feature interaction can be learned by gradient descent. The architecture parameters are jointly optimized with neural network weights by GRDA optimizer (Chao and Cheng, 2019) (an optimizer which is easy to produce a sparse solution) so that the training process can automatically abandon unimportant feature interactions (with zero values as the architecture parameters) and keep those important ones. After that, in the retrain stage, we select the feature interactions with nonzero values of the architecture parameters and retrain the model with the selected interactions while keeping the architecture parameters as attention units instead of indicators of interaction importance.
Extensive experiments are conducted on three largescale datasets (two are public benchmarks, and the other is private). Experimental results demonstrate that AutoFIS can significantly improve the CTR prediction performance of factorization models on all datasets. As AutoFIS can remove about 50%80% order feature interactions, original models can always achieve improvement on efficiency. We also apply AutoFIS for order interaction selection by learning the importance of each order feature interaction. Experimental results show that with about 1%–10% of the order interactions selected, the AUC of factorization models can be improved by 0.1%–0.2% without introducing much computation cost. The results show a promising direction of using AutoFIS for automatic highorder feature interaction selection. Experiments also demonstrate that important  and order feature interactions, identified by AutoFIS in factorization machine, can also greatly boost the performance of current stateoftheart models, which means we can use a simple model for interaction selection and apply the selection results to other models. Besides, we analyze the effectiveness of feature interactions selected by our model on real data and synthetic data. Furthermore, a tenday online A/B test is performed in a Huawei App Store recommendation service, where AutoFIS yielding recommendation model achieves improvement of CTR by and CVR by over DeepFM, which contributes a significant business revenue growth.
To summarize, the main contributions of this paper can be highlighted as follows:

[leftmargin = 10 pt]

We empirically verify that removing the redundant feature interactions is beneficial when training factorization models.

We propose a twostage algorithm AutoFIS to automatically select important loworder and highorder feature interactions in factorization models. In the search stage, AutoFIS can learn the relative importance of each feature interaction via architecture parameters within one full training process. In the retrain stage, with the unimportant interactions removed, we retrain the resulting neural network while keeping architecture parameters as attention units to help the learning of the model.

Offline experiments on three largescale datasets demonstrate the superior performance of AutoFIS in factorization models. Moreover, AutoFIS can also find a set of important highorder feature interactions to boost the performance of existing models without much computation cost introduced. A tenday online A/B test shows that AutoFIS improves DeepFM model by approximately 20% on average in terms of CTR and CVR.
2. Related Work
CTR prediction is generally formulated as a binary classification problem (Liu et al., 2019a). In this section we briefly review factorization models for CTR prediction and AutoML models for recommender systems.
Factorization machine (FM) (Rendle, 2010)
projects each feature into a lowdimensional vector and models feature interactions by inner product, which works well for sparse data. Fieldaware factorization machine (FFM)
(Juan et al., 2016) further enables each feature to have multiple vector representations to interact with features from other fields.Recently, deep learning models have achieved stateoftheart performance on some public benchmarks (Liu et al., 2019a; Wang et al., 2017). Several models use MLP to improve FM, such as Attention FM (Xiao et al., 2017), Neural FM (He and Chua, 2017). Wide & Deep (Cheng et al., 2016) jointly trains a wide model for artificial features and a deep model for raw features. DeepFM (Guo et al., 2017) uses an FM layer to replace the wide component in Wide & Deep. PNN (Qu et al., 2019) uses MLP to model the interaction of FM layer and feature embeddings while PIN (Qu et al., 2019) introduces a networkinnetwork architecture to model pairwise feature interactions with subnetworks rather than simple inner product operations in PNN and DeepFM. Note that all existing factorization models simply enumerate all order feature interactions which contain many useless and noisy interactions.
Gradient boosting decision tree (GBDT) (Chen and Guestrin, 2016)
is a method to do feature engineering and search interactions by decision tree algorithm. Then the transformed feature interactions can be fed into to logistic regression
(He et al., 2014) or FFM (Juan et al., 2014). In practice, tree models are more suitable for continuous data but not for highdimensional categorical data in recommender system because of the low usage rate of categorical features (Qu et al., 2019).In the meantime, there exist some works using AutoML techniques to deal with the problems in recommender system. AutoCross (Luo et al., 2019) is proposed to search over many subsets of candidate features to identify effective interactions. This requires training the whole model to evaluate the selected feature interactions, but the candidate sets are incredibly many: i.e., there are candidate sets for a dataset with fields for just order feature interactions. Thus AutoCross accelerates by two aspects of approximation: (i) it greedily constructs localoptimal feature sets via beam search in a tree structure, and (ii) it evaluates the newly generated feature sets via fieldaware logistic regression. Due to such two approximations, the highorder feature interactions extracted from AutoCross may not be useful for deep models. Compared with AutoCross, our proposed AutoFIS only needs to perform the search stage once to evaluate the importance of all feature interactions, which is much more efficient. Moreover, the learned useful interactions will improve the deep model as they are learned and evaluated in this deep model directly.
Recently, oneshot architecture search methods, such as DARTS (Liu et al., 2019b), have become the most popular neural architecture search (NAS) algorithms to efficiently search network architectures (Bender, 2019). In recommender systems, such methods are utilized to search proper interaction functions for collaborative filtering models (Yao, 2020). The model in (Yao, 2020) focuses on identifying proper interaction functions for feature interactions while our model focuses on searching and keeping important feature interactions. Inspired by the recent work DARTS for neural architecture search, we formulate the problem of searching the effective feature interactions as a continuous searching problem by incorporating architecture parameters. Different from DARTS using twolevel optimization to optimize the architecture parameters and the model weights alternatively and iteratively with the training set and the validation set, we use onelevel optimization to train these two types of parameters jointly with all data as the training set. We analyze their difference theoretically in Section 3.2, and compare their performance in the Experiments Section.
3. Methodology
In this section, we describe the proposed AutoFIS, an algorithm to select important feature interactions in factorization models automatically.
3.1. Factorization Model (Base Model)
First, we define factorization models:
Definition 3.1 ().
Factorization models are the models where the interaction of several embeddings from different features is modeled into a real number by some operation such as inner product or neural network.
We take FM, DeepFM, and IPNN as instances to formulate our algorithm and explore the performance on various datasets. Figure 1 presents the architectures of FM, DeepFM and IPNN models. FM consists of a feature embedding layer and a feature interaction layer. Besides these two layers, DeepFM and IPNN model include an extra layer: MLP layer. The difference between DeepFM and IPNN is that feature interaction layer and MLP layer work in parallel in DeepFM, while ordered in sequence in IPNN.
In the subsequent subsections, we will brief the feature embedding layer and feature interaction layer in FM. To work with DeepFM and IPNN model, the MLP layer and output layer are also formalized. Then the detail of how our proposed AutoFIS working on the feature interaction layers is elaborated, i.e., selecting important feature interactions based on architecture parameters.
Feature Embedding Layer. In most CTR prediction tasks, data is collected in a multifield categorical form^{1}^{1}1Features in numerical form are usually transformed into categorical form by bucketing.. A typical data preprocess is to transform each data instance into a highdimensional sparse vector via onehot or multihot encoding. A field is represented as a multihot encoding vector only when it is multivariate. A data instance can be represented as
where is the number of fields and is the onehot or multihot encoding vector of the th field. A feature embedding layer is used to transform an encoding vector into a lowdimensional vector as
(1) 
where is the a matrix, is the number of feature values in the th field and is the dimension of lowdimensional vectors.

If is a onehot vector with th element , then the representation of is .

If is a multihot vector with for and the embeddings of these elements are , then the representation of is the sum or average of these embeddings (Covington et al., 2016).
The output of the feature embedding layer is then the concatenation of multiple embedding vectors as
Feature Interaction Layer. After transforming the features to lowdimensional space, the feature interactions can be modeled in such a space with the feature interaction layer. First, the inner product of the pairwise feature interactions is calculated:
(2) 
where is the feature embedding of th field, is the inner product of two vectors. The number of pairwise feature interactions in this layer is .
In FM and DeepFM models, the output of the feature interaction layer is:
(3) 
Here, all the feature interactions are passed to the next layer with equal contribution. As pointed in Section 1 and will be verified in Section 4, not all the feature interactions are equally predictive and useless interactions may even degrade the performance. Therefore, we propose the AutoFIS algorithm to select important feature interactions efficiently.
To study whether our methods can be used to identify important highorder interactions, we define the feature interaction layer with order interactions (i.e., combination of three fields) as:
(4) 
MLP Layer.
MLP Layer consists of several fully connected layers with activation functions, which learns the relationship and combination of features. The output of one such layer is
(5) 
where are the input, model weight, and bias of the th layer. Activation . is the input layer and , where is the depth of MLP layer MLP.
Output Layer. FM model has no MLP layer and connects the feature interaction layer with prediction layer directly:
(6) 
where is the predicted CTR.
DeepFM combines feature interaction layer and MLP layers in parallel as
(7) 
While in IPNN, MLP layer is sequential to feature interaction layer as
(8) 
Note that the MLP layer of IPNN can also serve as a reweighting of the different feature interactions, to capture their relative importance. This is also the reason that IPNN has a higher capacity than FM and DeepFM. However, with the IPNN formulation, one cannot retrieve the exact value corresponding to the relative contribution of each feature interaction. Therefore, the useless feature interactions in IPNN can be neither identified nor dropped, which brings extra noise and computation cost to the model. We would show in the following subsections and Section 4 that how the proposed method AutoFIS could improve IPNN.
Objective Function. FM, DeepFM, and IPNN share the same objective function, i.e., to minimize the crossentropy of predicted values and the labels as
(9) 
where is the label and is the predicted probability of .
3.2. AutoFIS
AutoFIS automatically selects useful feature interactions, which can be applied to the feature interaction layer of any factorization model. In this section, we elaborate on how it works. AutoFIS can be split into two stages: search stage and retrain stage. In the search stage, AutoFIS detects useful feature interactions; while in the retrain stage, the model with selected feature interactions is retrained.
Search Stage. To facilitate the presentation of the algorithm, we introduce the gate operation to control whether to select a feature interaction: an open gate corresponds to selecting a feature interaction, while a closed gate results in a dropped interaction. The total number of gates corresponding to all the order feature interactions is . It is very challenging to find the optimal set of open gates in a bruteforce way, as we face an incredibly huge () space to search. In this work, we approach the problem from a different viewpoint: instead of searching over a discrete set of open gates, we relax the choices to be continuous by introducing architecture parameters , so that the relative importance of each feature interaction can be learned by gradient descent. The overview of the proposed AutoFIS is illustrated in Figure 2.
This architecture selection scheme by gradient learning is inspired by DARTS (Liu et al., 2019b)
, where the objective is to select one operation from a set of candidate operations in convolutional neural network (CNN) architecture.
To be specific, we reformulate the interaction layer in factorization models (shown in Equation 3) as
(10) 
where are the architecture parameters. In the search stage of AutoFIS, values are learned in such a way that can represent the relative contribution of each feature interaction to the final prediction. Then, we can decide the gate status of each feature interaction by setting those unimportant ones (i.e., with zero values) closed.
Batch Normalization. From the viewpoint of the overall neural network, the contribution of a feature interaction is measured by (in Equation 10). Exactly the same contribution can be achieved by rescaling this term as , where is a real number.
Since the value of is jointly learned with
, the coupling of their scale would lead to unstable estimation of
, such that can no longer represent the relative importance of . To solve this problem, we apply Batch Normalization (BN) (Ioffe and Szegedy, 2015) on to eliminate its scale issue. BN has been adopted by training deep neural networks as a standard approach to achieve fast convergence and better performance. The way that BN normalizes values gives an efficient yet effective way to solve the coupling problem of and .The original BN normalizes the activated output with statistics information of a minibatch. Specifically,
(11) 
where , and are input, normalized and output values of BN; and
are the mean and standard deviation values of
over a minibatch ; and are trainable scale and shift parameters of BN; is a constant for numerical stability.To get stable estimation of , we set the scale and shift parameters to be 1 and 0 respectively. The BN operation on each feature interaction is calculated as
(12) 
where and are the mean and standard deviation of in minibatch .
GRDA Optimizer. Generalized regularized dual averaging (GRDA) optimizer (Chao and Cheng, 2019) is aimed to get a sparse deep neural network. To update at each gradient step with data we use the following equation:
(13) 
where , and is the learning rate, and are adjustable hyperparameters to tradeoff between accuracy and sparsity.
In the search stage, we use GRDA optimizer to learn the architecture parameters and get a sparse solution. Those unimportant feature interactions (i.e., with zero values) will be thrown away automatically. Other parameters are learned by Adam optimizer, as normal.
Onelevel Optimization. To learn the architecture parameters in the search stage of AutoFIS, we propose to optimize jointly with all the other network weights (such as in Equation 3 and in Equation 5). This is different from DARTS. DARTS treats as higherlevel decision variables and the network weights as lowerlevel variables, then optimizes them with a bilevel optimization algorithm. In DARTS, it is assumed that the model can select the operation only when the network weights are properly learned so that can ”make its proper decision”. In the context of AutoFIS formulation, this means that we can decide whether a gate should be open or closed after the network weights are properly trained, which leads us back to the problem of fully training models to make the decision. To avoid this issue, DARTS proposes to approximate the optimal value of the network weights with only one gradient descent step and train and iteratively.
We argue that the inaccuracy of this approximation might downgrade the performance. Therefore, instead of using bilevel optimization, we propose to optimize and jointly with onelevel optimization. Specifically, the parameters and are updated together with gradient descent using the training set by descending on and based on
(14) 
In this setting, and can explore their design space freely until convergence, and is learned to serve as the contribution of individual feature interactions. In Section 4, we would show the superiority of onelevel optimization over twolevel optimization.
Retrain Stage. After the training of the search stage, some unimportant interactions are thrown away automatically according to the architecture parameters in search stage. We use to represent the gate status of feature interaction and set as when ; otherwise, we set as . In the retrain stage, the gate status of these unimportant feature interactions are fixed to be closed permanently.
After removing these unimportant interactions, we retrain the new model with kept in the model. Specifically, we replace the feature interaction layer in Equation 3 with
(15) 
Note here no longer serves as an indicator for deciding whether an interaction should be included in the model (as in search stage). Instead, it serves as an attention unit for the architecture to learn the relative importance of the kept feature interaction. In this stage, we do not need to select the feature interactions. Therefore, all parameters are learned by Adam optimizer.
4. Experiments
In this section, we conduct extensive offline experiments^{2}^{2}2Repeatable experiment code: https://github.com/zhuchenxv/AutoFIS on two benchmark public datasets and a private dataset, as well as online A/B test, to answer the following questions:

[leftmargin = 10 pt]

RQ1: Could we boost the performance of factorization models with the selected interactions by AutoFIS?

RQ2: Could interactions selected from simple models be transferred to the stateoftheart models to reduce their inference time and improve prediction accuracy?

RQ3: Are the interactions selected by AutoFIS really important and useful?

RQ4: Can AutoFIS improve the performance of existing models in a live recommender system?

RQ5: How do different components of our AutoFIS (e.g., BN) contribute to the performance?
Model  Avazu  Criteo  
AUC  log loss  top  time (s)  search + retrain  Rel. Impr.  AUC  log loss  top  time (s)  search + retrain  Rel. Impr.  
cost (min)  cost (min)  
FM  0.7793  0.3805  100%  0.51  0 + 3  0  0.7909  0.5500  100%  0.74  0 + 11  0 
FwFM  0.7822  0.3784  100%  0.52  0 + 4  0.37%  0.7948  0.5475  100%  0.76  0 + 12  0.49% 
AFM  0.7806  0.3794  100%  1.92  0 + 14  0.17%  0.7913  0.5517  100%  1.43  0 + 20  0.05% 
FFM  0.7831  0.3781  100%  0.24  0 + 6  0.49%  0.7980  0.5438  100%  0.49  0 + 39  0.90% 
DeepFM  0.7836  0.3776  100%  0.76  0 + 6  0.55%  0.7991  0.5423  100%  1.17  0 + 16  1.04% 
GBDT+LR  0.7721  0.3841  100%  0.45  8 + 3  0.92%  0.7871  0.5556  100%  0.62  40 + 10  0.48% 
GBDT+FFM  0.7835  0.3777  100%  2.66  6 + 21  0.54%  0.7988  0.5430  100%  1.68  9 + 57  1.00% 
AutoFM(2nd)  0.7831*  0.3778*  29%  0.23  4 + 2  0.49%  0.7974*  0.5446*  51%  0.48  14 + 9  0.82% 
AutoDeepFM(2nd)  0.7852*  0.3765*  24%  0.48  7 + 4  0.76%  0.8009*  0.5404*  28%  0.69  22 + 11  1.26% 
FM(3rd)  0.7843  0.3772  100%  5.70  0 + 21  0.64%  0.7965  0.5457  100%  8.21  0 + 72  0.71% 
DeepFM(3rd)  0.7854  0.3765  100%  5.97  0 + 23  0.78%  0.7999  0.5418  100%  13.07  0 + 125  1.14% 
AutoFM(3rd)  0.7860*  0.3762*  25% / 2%  0.33  22 + 5  0.86%  0.7983*  0.5436*  35% / 1%  0.63  75 + 15  0.94% 
AutoDeepFM(3rd)  0.7870*  0.3756*  21% / 10%  0.94  24 + 10  0.99%  0.8010*  0.5404*  13% / 2%  0.86  128 + 17  1.28% 

denotes statistically significant improvement (measured by ttest with pvalue
0.005) over baselines with same order. AutoFM compares with FM and AutoDeepFM compares with all baselines.
4.1. Datasets
Experiments are conducted for the following two public datasets (Avazu and Criteo) and one private dataset:
Avazu^{3}^{3}3http://www.kaggle.com/c/avazuctrprediction: Avazu was released in the CTR prediction contest on Kaggle. of randomly shuffled data is allotted to training and validation with for testing. Categories with less than 20 times of appearance are removed for dimensionality reduction.
Criteo^{4}^{4}4http://labs.criteo.com/downloads/downloadterabyteclicklogs/: Criteo contains one month of click logs with billions of data samples. We select ”data 612” as training and validation set while selecting ”day13” for evaluation. To counter label imbalance, negative downsampling is applied to keep the positive ratio roughly at . 13 numerical fields are converted into onehot features through bucketing, where the features in a certain field appearing less than 20 times are set as a dummy feature ”other”.
Private: Private dataset is collected from a game recommendation scenario in Huawei App Store. The dataset contains app features (e.g., ID, category), user features (e.g., user’s behavior history) and context features. Statistics of all the datasets are summarized in Table 2.
Dataset  #instances  #dimension  #fields  pos ratio 
Avazu  24  0.17  
Criteo  39  0.50  
Private  29  0.02 
4.2. Experimental Settings
4.2.1. Baselines and Evaluation Metrics
We apply AutoFIS to FM (Rendle, 2010) and DeepFM (Guo et al., 2017) models to show its effectiveness (denoted as AutoFM and AutoDeepFM, respectively). We compare it with GBDTbased methods (GBDT+LR (He et al., 2014), GBDT+FFM (Juan et al., 2014)) and Factorization Machine models (AFM (Xiao et al., 2017), FwFM (Pan et al., 2018), FFM (Juan et al., 2016), IPNN (Qu et al., 2019)). Due to its huge computational costs and the unavailability of the source code, we do not compare our models with AutoCross (Luo et al., 2019).
The common evaluation metrics for CTR prediction are
AUC (Area Under ROC) and Log loss (crossentropy).4.2.2. Parameter Settings
To enable any one to reproduce the experimental results, we have attached all the hyperparameters for each model in the supplementary material.
4.2.3. Implementation Details
Selecting order feature interactions for AutoFM and AutoDeepFM, in the search stage, we first train the model with and jointly on all the training data. Then we remove those useless interactions and retrain our model.
To implement AutoFM and AutoDeepFM for order feature interaction selection, we reuse the selected order interactions in Equation 15 and enumerate all the order feature interactions in the search stage to learn their importance. Finally, we retrain our model with the selected  and order interactions.
Note that in the search stage, the architecture parameters are optimized by GRDA optimizer and other parameters are optimized by Adam optimizer. In the retrain stage, all parameters are optimized by Adam optimizer.
Model  AUC  log loss  top  ReI. Impr 
FM  0.8880  0.08881  100%  0 
FwFM  0.8897  0.08826  100%  0.19% 
AFM  0.8915  0.08772  100%  0.39% 
FFM  0.8921  0.08816  100%  0.46% 
DeepFM  0.8948  0.08735  100%  0.77% 
AutoFM(2nd)  0.8944*  0.08665*  37%  0.72% 
AutoDeepFM(2nd)  0.8979*  0.08560*  15%  1.11% 

denotes statistically significant improvement (measured by ttest with pvalue0.005).

AutoFM compares with FM and AutoDeepFM compares with all baselines.
4.3. Feature Interaction Selection by AutoFIS (RQ1)
Table 1 summarizes the performance of AutoFM and AutoDeepFM by automatically selecting  and order important interactions on Avazu and Criteo datasets and Table 3 reports their performance on Private dataset. We can observe:

[leftmargin = 10 pt]

For Avazu dataset, 71% of the order interactions can be removed for FM and 76% for DeepFM. Removing those useless interactions can not only make the model faster at inference time: the inference time of AutoFM(2nd) and AutoDeepFM(2nd) is apparently less than FM and DeepFM; but also significantly increase the prediction accuracy: the relative performance improvement of AutoFM(2nd) over FM is 0.49% and that of AutoDeepFM(2nd) over DeepFM is 0.20% in terms of AUC. Similar improvement can also be drawn from the other datasets.

For highorder feature interaction selection, only 2% – 10% of all the order feature interactions need to be included in the model. The inference time of AutoFM(3rd) and AutoDeepFM(3rd) is much less than that of FM(3rd) and DeepFM(3rd) (which is comparable to FM and DeepFM). Meanwhile, the accuracy is significantly improved by removing unimportant order feature interactions, i.e., the relative performance improvement of AutoFM(3rd) over FM(3rd) is 0.22% and that of AutoDeepFM(3rd) over DeepFM(3rd) is 0.20% in terms of AUC on Avazu. Observations on Criteo are similar.

All such performance boost could be achieved with marginal time cost (for example, it takes 24 minutes and 128 minutes for AutoDeepFM(3rd) to search important  and order feature interactions in Avazu and Criteo with a single GPU card). The same result might take the human engineers many hours or days to achieve by identifying such important feature interactions manually.
Note that directly enumerating the order feature interactions in FM and DeepFM enlarges the inference time about 7 to 12 times, which is unacceptable in industrial applications.
4.4. Transferability of the Selected Feature Interactions (RQ2)
Model  Avazu  Criteo  
AUC  log loss  time(s)  AUC  log loss  time(s)  
IPNN  0.7868  0.3756  0.91  0.8013  0.5401  1.26 
AutoIPNN(2nd)  0.7869  0.3755  0.58  0.8015  0.5399  0.76 
AutoIPNN(3rd)  0.7885*  0.3746*  0.71  0.8019*  0.5392*  0.86 

denotes statistically significant improvement (measured by ttest with pvalue0.005).
In this subsection, we investigate whether the feature interactions learned by AutoFM (which is a simple model) could be transferred to the stateoftheart models such as IPNN to boost their performance. As shown in Table 4, using order feature interactions selected by AutoFM (namely AutoIPNN(2nd)) achieves comparable performance to IPNN, with around 30% and 50% of all the interactions in Avazu and Criteo. Moreover, the performance is significantly improved by using both  and order feature interactions (namely AutoIPNN(3rd)) selected by AutoFM. Both evidences verify the transferability of the selected feature interactions in AutoFM.
4.5. The Effectiveness of Feature Interaction Selected by AutoFIS (RQ3)
In this subsection, we will discuss the effectiveness of feature interaction selected by AutoFIS. We conduct experiments on real data and synthetic data to analyze it.
4.5.1. The Effectiveness of selected feature interaction on Real Data
We define to represent the importance of a feature interaction to the final prediction. For a given interaction, we construct a predictor only considering this interaction where the prediction of a test instance is the statistical CTR () of specified feature interaction in the training set. Then the AUC of this predictor is with respect to this given feature interaction. Higher indicates a more important role of this feature interaction in prediction. Then we visualize the relationship between and value.
As shown in Figure 3, we can find that most of the feature interactions selected by our model (with high absolute value) have high , but not all feature interactions with high are selected. That is because the information in these interactions may also exist in other interactions which are selected by our model.
Model  AUC  log loss 
Selected by  0.7804  0.3794 
Selected by AutoFM  0.7831  0.3778 
To evaluate the effectiveness of the selected interactions by our model, we also select the top ( is the number of secondorder feature interactions selected by our model) interactions based on and retrain the model with these interactions. As shown in Table 5, the performance of our model is much better than the model with selected interactions by with same computational cost.
4.5.2. The Effectiveness of selected feature interaction on Synthetic Data
In this section, we conduct a synthetic experiment to validate the effectiveness of selected feature interaction.
This synthetic dataset is generated from an incomplete poly2 function, where the bilinear terms are analogous to interactions between categories. Based on this dataset, we investigate (i) whether our model could find the important interactions (ii) the performance of our model compared with other factorization machine models.
The input of this dataset is randomly sampled from categories of fields. The output is binary labeled depending on the sum of linear terms and parts of bilinear terms.
(16) 
(17) 
The data distribution , selected bilinear term sets and are randomly sampled and fixed. The data pairs are sampled to build the training and test datasets. We also add a small random noise to the sampled data. We use FM and our model to fit the synthetic data. We use AUC to evaluate these models on the test dataset.
We choose to test the effectiveness of our model. Selected bilinear term sets is randomly initialized as . Figure 4 presents the performance comparison between our model and FM, which demonstrates the superiority of our model. As shown in Figure 5, our model could extract the important interactions precisely. The interactions in have the highest and some unimportant interactions (with value 0) have been removed.
4.6. Deployment & Online Experiments (RQ4)
Online experiments were conducted in the recommender system of Huawei App Store to verify the superior performance of AutoDeepFM. Huawei App Store has hundreds of millions of daily active users which generates hundreds of billions of user log events everyday in the form of implicit feedback such as browsing, clicking and downloading apps. In online serving system, hundreds of candidate apps that are most likely to be downloaded by the users are selected by a model from the universal app pool. These candidate apps are then ranked by a finetuned ranking model (such as DeepFM, AutoDeepFM) before presenting to users. To guarantee user experience, the overall latency of the abovementioned candidate selection and ranking is required to be within a few milliseconds. To deploy AutoDeepFM, we utilize a threenode cluster, where each node is with 48 core Intel Xeon CPU E52670 (2.30GHZ), 400GB RAM and as well as 2 NVIDIA TESLA V100 GPU cards.
Specifically, a tenday AB test is conducted in a game recommendation scenario in the App Store. Our baseline in online experiments is DeepFM, which is a strong baseline due to its extraordinary accuracy and high efficiency which has been deployed in the commercial system for a long time.
For the control group, 5% of users are randomly selected and presented with recommendation generated by DeepFM. DeepFM is chosen as a strong baseline due to its extraordinary accuracy and high efficiency, which has been deployed in our commercial system for a long time. For the experimental group, 5% of users are presented with recommendation generated by AutoDeepFM.
Figure 6 and Figure 7 show the improvement of the experimental group over the control group with CTR () and CVR () respectively. We can see that the system is rather stable where both CTR and CVR fluctuated within during the A/A testing. Our AutoDeepFM model is launched to the live system on Day 8. From Day 8, we observe a significant improvement over the baseline model with respect to both CTR and CVR. The average improvement of CTR is 20.3% and the average improvement of CVR is 20.1% over the ten days of A/B test. These results demonstrate the magnificent effectiveness of our proposed model. From Day 18, we conduct again A/A test to replace our AutoDeepFM model with the baseline model in the experimental group. We observe a sharp drop in the performance of the experimental group, which once more verifies that the improvement of online performance in the experimental group is indeed introduced by our proposed model.
4.7. Ablation Study (RQ5)
4.7.1. Stability of estimation across different seeds
In this part, we conduct experiments to check whether the trained value of is stable across different random initializations. A stable estimation of means that the model’s decision on which interaction is important is not affected by the random seed. We run the search stage of AutoFM with different seeds on Avazu. The Pearson correlation of estimated from different seeds is around 0.86, this validates that the estimation of is stable. Without the use of BN for the feature interaction (which is essentially FwFM model), this Pearson correlation drop to around 0.65.
4.7.2. Effectiveness of components in AutoIFS
Variants  search stage  retrain stage  
AutoFIS  Random  BN  
AutoFM  
AutoFMBN  
AutoFMBN  
Random+FM 
Model  Avazu  Criteo  
AUC  log loss  AUC  log loss  
FM  0.7793  0.3805  0.7909  0.5500 
AutoFM  0.7831  0.3778  0.7974  0.5446 
AutoFMBN  0.7824  0.3783  0.7971  0.5450 
AutoFMBN  0.7811  0.3793  0.7946  0.5481 
Random+FM  0.7781  0.3809  0.5486 
To validate the effectiveness of individual components in AutoFIS, we propose several variants, which are enumerated in Table 6. Recall that AutoFIS has two stages: search stage and retrain stage. To verify the effectiveness of the search stage of AutoFIS, we compare it with ”Random” strategy, which selects feature interactions randomly. Similarly, in the retrain stage, we validate the advantages of BN and . The relationship between different components in the two stages is presented in Table 6. The performance of such variants presented in Table 7. Note that for ”Random” strategy, we choose the same number of interactions with AutoFM, and we try ten different ”Random” strategies and average the results. We can get several conclusions:

[leftmargin = 15 pt]

Comparing AutoFMBN with Random+FM, we can see that selection by AutoFIS can always achieve better performance than Random selection with same number of interactions. It demonstrates that important interactions are identified by AutoFIS in the search stage.

The performance gap between Random+FM and FM in Criteo dataset indicates that random selection on feature interactions may outperform the model keeping all the feature interactions under some circumstances, which supports our statement: removing some useless feature interactions could improve the performance.

The comparison between AutoFM and AutoFMBN validates the effectiveness of BN in the retrain stage, where the reason is stated in ”AutoFIS” section.

The performance gap between AutoFMBN and AutoFMBN shows that improve the performance, as it differentiates the contribution of different feature interactions in the retrain stage.
Model  Avazu  Criteo  
AUC  log loss  AUC  log loss  
AutoFM  0.7831  0.3778  0.7974  0.5446 
BiAutoFM  0.7816  0.3787  0.7957  0.5464 
AutoDeepFM  0.7852  0.3765  0.8009  0.5404 
BiAutoDeepFM  0.7843  0.3771  0.8002  0.5412 
4.7.3. Onelevel V.S. bilevel optimization
In this section, we compare the onelevel and bilevel optimization on AutoFM and the results are presented in Table 8. The performance gap between AutoFM and BiAutoFM (and between AutoDeepFM and BiAutoDeepFM) demonstrates the superiority of onelevel optimization over bilevel, with the reason stated in ”Onelevel Optimization” section.
5. Conclusion
In this work, we proposed AutoFIS to automatically select important  and order feature interactions. The proposed methods are generally applicable to all the factorization models and the selected important interactions can be transferred to other deep learning models for CTR prediction. The proposed AutoFIS is easy to implement with marginal search costs, and the performance improvement is significant in two benchmark datasets and one private dataset. The proposed methods have been deployed onto the training platform of Huawei App Store recommendation service, with significant economic profit demonstrated.
References
 Understanding and simplifying oneshot architecture search. In CVPR, Cited by: §2.
 Latent cross: making use of context in recurrent recommender systems. In WSDM, pp. 46–54. Cited by: §1.
 A generalization of regularized dual averaging and its dynamics. In CoRR, pp. abs/1909.10072 (2019). Cited by: §1, §3.2.
 XGBoost:a scalable tree boosting system. In SIGKDD, pp. 785–794. Cited by: §1, §2.
 Wide & deep learning for recommender systems. In DLRS@RecSys, Cited by: §1, §1, §1, §1, §2.
 Deep neural networks for youtube recommendations. In RecSys, pp. 191–198. Cited by: §1, 2nd item.
 DeepFM: A factorizationmachine based neural network for CTR prediction. In IJCAI, pp. 1725–1731. Cited by: §1, §1, §1, §2, §4.2.1.
 Neural factorization machines for sparse predictive analytics. In SIGIR, pp. 355–364. Cited by: §1, §2.
 Practical lessons from predicting clicks on ads at facebook. In ADKDD@KDD, pp. 5:1–5:9. Cited by: §1, §1, §2, §4.2.1.
 Multilayer feedforward networks are universal approximators. In Neural Networks, Cited by: §1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456. Cited by: §3.2.
 Fieldaware factorization machines for CTR prediction. In RecSys, Cited by: §2, §4.2.1.
 3 idiots’ approach for display advertising challenge. Note: https://www.csie.ntu.edu.tw/ r01922136/kaggle2014criteo.pdf Cited by: §2, §4.2.1.
 Matrix factorization techniques for recommender systems. IEEE Computer 42 (8), pp. 30–37. Cited by: §1.
 xDeepFM: combining explicit and implicit feature interactions for recommender systems. In KDD, Cited by: §1.
 Feature generation by convolutional neural network for clickthrough rate prediction. In WWW, pp. 1119–1129. Cited by: §2, §2.
 DARTS: differentiable architecture search. In ICLR, Cited by: §1, §2, §3.2.
 AutoCross: automatic feature crossing for tabular data in realworld applications. In KDD, pp. 1936–1945. Cited by: §1, §2, §4.2.1.
 Ad click prediction: a view from the trenches. In KDD, Cited by: §1.
 Fieldweighted factorization machines for clickthrough rate prediction in display advertising. In WWW, pp. 1349–1357. Cited by: §4.2.1.
 Productbased neural networks for user response prediction over multifield categorical data. ACM Trans. Inf. Syst. 37 (1), pp. 5:1–5:35. Cited by: Appendix A, §1, §1, §2, §2, §4.2.1.
 Factorization machines. In ICDM, pp. 995–1000. Cited by: §1, §2, §4.2.1.
 Failures of gradientbased deep learning. In ICML, pp. 3067–3075. Cited by: §1, §1.
 Deep & cross network for ad click predictions. In ADKDD@KDD, pp. 12. Cited by: §2.

Attentional factorization machines: learning the weight of feature interactions via attention networks
. In IJCAI, pp. 3119–3125. Cited by: §1, §2, §4.2.1.  Efficient neural interaction function search for collaborative filtering. In WWW, Cited by: §2.
 Deep interest network for clickthrough rate prediction. In KDD, pp. 1059–1068. Cited by: §1, §1.
Appendix A Parameter Settings
For Avazu and Criteo datasets, the parameters of baseline models are set following (Qu et al., 2019). For AutoFM and AutoDeepFM we use the same hyperparameters as the base models (i.e., FM and DeepFM accordingly) except for extra ones in AutoFM and AutoDeepFM.
[t]
Model
Avazu
Criteo
General
bs=2000
opt=Adam
lr=1e3
bs=2000
opt=Adam
lr=1e3
GBDT+LR
#tree=50
#child=2048
#tree=80
#child=1024
GBDT+FFM
#tree=50
#child=1024
#tree=20
#child=512
FM
k=40
k=20
FwFM
k=40
wt_init =0.7
wt_l1 = 1e8
wt_l2=1e7
k=20
wt_init =0.7
wt_l1 = 0
wt_l2=1e7
FFM
k=4
k=4
AFM
k=40
t=1
h=256
l2_a =0
k=20
t=0.01
h=32
l2_a=0.1
DeepFM
k=40
net=[700 5, 1]
l2=0
drop=1
BN=True
k=20
net=[700 5, 1]
l2=0
drop=1
BN=True
AutoDeepFM
c=0.0005
mu=0.8
c=0.0005
mu=0.8
AutoFM
c=0.005
mu=0.6
c=0.0005
mu=0.8

Note: bs=batch size, opt=optimizer, lr=learning rate, k=embedding size, wt_init = initial value for , wt_l1 = regularization on , wt_l2 = regularization on , t=Softmax Temperature, l2_a= L2 Regularization on Attention Network, net=MLP structure, LN=layer normalisation, BN=batch normaliation, c and mu are parameters in GRDA Optimizer.
Comments
There are no comments yet.