1 Introduction
Data representation learning bengio2013representation
is an important yet challenging task in machine learning community. Recently, deep neural networks (DNN) have made impressive progress in representation learning for various types of data, including speech, image, and natural language, in which dedicated networks such as RNN
gers2002learning ; cho2014learning , CNN krizhevsky2012imagenet and Transformer vaswani2017attention are effective at learning compact representations to improve the predictive performance. However, many realworld applications, such as clickthrough rate prediction in recommendation and online advertising, still rely on feature engineering to explore higherorder feature space, where combinatorial features are crafted. For example, a order interactive feature “Gender Age Income” can be discriminative for the types of recommended commodities. Such highorder interactive features can introduce nonlinearity to the lightweight models used in practice, with negligible decrease in their inference speed. Besides, they have strong interpretability that would otherwise be weak due to the blackbox nature of DNN.In practice, the task of generating interactive features needs to evaluate a large number of candidates and compare their performances. Since a bruteforce method is intolerable, traditional interactive feature generation methods heavily rely on the experience and knowledge of domain experts, which requires timeconsuming and taskspecific efforts. This motivates automatic feature generation katz2016explorekit ; kaul2017autolearn , one major topic of Automated Machine Learning (AutoML) hutter2019automated , which has attracted increasing attention from both academia and industry.
Recent works on automatic feature generation can be roughly divided into two categories: searchbased kanter2015deep ; katz2016explorekit ; luo2019autocross and DNNbased rendle2010factorization ; juan2016field ; shen2016deepcross ; ijcai2017deep ; lian2018xdeepfm ; xu2018how ; li2019fi ; song2019autoint ; liu2020autofis . The searchbased methods focus on designing different search strategies that prune as much of the candidates to be evaluated as possible, while aiming to keep the most useful interactive features. For example, ExploreKit katz2016explorekit evaluates only the topranked candidates where the feature ranker has been pretrained on other datasets. AutoCross luo2019autocross incrementally selects the optimal candidate from the pairwise interactions of current features. Although these mechanisms can trim the search space to be traversed, due to their trialanderror nature, the needed time and computing resources are usually intolerable in practice. On the other hand, the DNNbased methods design specific neural architectures to express the interactions among different features. For example, AutoInt song2019autoint and FiGNN li2019fi exploit the selfattention mechanism vaswani2017attention for weighting the features in their synthesis, and they achieve leading performances with just a oneshot training course. But this advantage comes at the cost of implicit feature interactions as it is hard to exactly interpret which interactive features are useful from the attention weights. Actually, it is in great demand that useful interactive features can be explicitly expressed, since they can be incorporated to train some lightweight predictive models to satisfy the requirement of realtime inference.
To possess both of their merits, that is, the interpretability of searchbased methods and the efficiency of DNNbased methods, we propose a novel method for automatic feature generation, called Feature Interaction Via Edge Search (FIVES). First, we propose to inductively search for an optimal collection of order features from the interactions between the generated order features and the original features. A theoretical analysis (see Proposition 1) is provided to explain the intuition behind this search strategy—informative interaction features tend to come from the informative lowerorder ones. Then we instantiate this inductive search strategy by modeling the features as a feature graph and expressing the interactions between nodes as propagating the graph signal via a designed graph neural network (GNN). In the defined feature graph, the layerwise adjacency matrix determines which original features should be selected to interact with which of the features generated by the previous layer. In this way, we formulate the task of interactive feature generation as edge search—learning the adjacency tensor. Inspired by differentiable neural architecture search (NAS) liu2018darts ; noy2019asap , we solve the edge search problem by alternatively updating the adjacency tensor and the predictive model, which transforms the trialanderror procedure into a differentiable NAS problem. In particular, the learned adjacency tensor can explicitly indicate which interactive features are useful. We also parameterize the adjacency matrices in a recursive way where the dependencies are conceptually required by our search strategy. Such kinds of dependencies are ignored by most NAS methods zhou2019bayesnas . To validate the proposed method, we compare FIVES with both searchbased and DNNbased methods, on five benchmark datasets and two realworld busisness datasets. The experimental results demonstrate the advantages of FIVES, as both an endtoend predictive model and a feature generator for other lightweight predictive models.
2 Methodology
2.1 Preliminary
We consider the ubiquitous tabular data where each column represents a feature and each row represents a sample. Without loss of generality, we assume that numeric features have been discretized and thus all considered features are categorical ones, for example, feature “user city” takes value from , and feature “user age” takes value from . Based on these, we first define the highorder interactive feature:
Definition 1.
Given original features , a order () interactive feature can be represented as the Cartesian product of distinct original features where each feature is selected from .
Since the interactive features bring in nonlinearity, e.g., the Cartesian product of two binary features enables a linear model to predict the label of their XOR relation, they are widely adopted to improve the performance of machine learning methods on tabular data. The goal of interactive feature generation is to find a set of such interactive features.
As the number of all possible interactions from original features is , it is challenging to search for the optimal set of interactive features from such a huge space, not to mention the evaluation of a proposed feature can be costly and noisy. Thus some recent methods model the search procedure as the optimization of designed DNNs, transforming it into a oneshot training at the cost of interpretability. In the rest of this section, we present our proposed method—FIVES, which can provide explicit interaction results via an efficient differentiable search.
2.2 Search Strategy
As aforementioned, exhaustively traversal of the exponentially growing interactive feature space seems intractable. By Definition 1, any order interactive features could be regarded as the interaction (i.e., Cartesian product) of several lowerorder features, with many choices of the decomposition. This raises a question that could we solve the task of generating interactive features in a bottomup manner? To be specific, could we generate interactive features in an inductive manner, that is, searching for a group of informative order features from the interactions between original features and the group of order features identified in previous step? We present the following proposition to provide theoretical evidence for discussing the question.
Proposition 1.
Let and
be Bernoulli random variables with a joint conditional probability mass function,
such that . Suppose further that mutual information between and satisfies where and is a nonnegative constant. If and are weakly correlated given , that is, , we have(1) 
We defer the proof to the supplementary material. Specifically, the random variable and stands for the feature and the label respectively, and the joint of s stands for their interaction. Recall that: 1) As the considered raw features are categorical, modeling each feature as a Bernoulli random variable would not sacrifice much generality; 2) In practice, the raw features are often preprocessed to remove redundant ones, so the weak correlation assumption holds. Based on these, our proposition indicates that, for small we have , thereby the information gain introduced by interaction of features is at most that of the individuals. This proposition therefore could be interpreted as—under practical assumptions, it is unlikely to construct an informative feature from the interaction of uninformative ones.
This proposition supports the bottomup search strategy, as lowerorder features that have not been identified as informative are less likely to be a useful building brick of highorder features. Besides, the identified order features are recursively constructed from the identified ones of lowerorders, and thus they are likely to include sufficient information for generating informative order features. We also empirically validate this strategy in Section 3.2 and Section 3.3. Although this inductive search strategy cannot guarantee to generate all useful interactive features, the generated interactive features in such a way are likely to be useful ones, based on the above proposition. This can be regarded as a tradeoff between the usefulness of generated interactive features and the completeness of them, under the constraint of limited computation resources.
2.3 Modeling
To instantiate our inductive search strategy, we conceptually regard the original features as a feature graph and model the interactions among features by a designed GNN.
First, we denote the feature graph as where each node corresponds to a feature and each edge indicates an interaction between node and node . We use as the initial node representation for node that conventionally takes the embedding looked up by from the feature embedding matrix as its value. It is easy to show that, by applying a vanilla graph convolutional operator to such a graph, the output is capable of expressing the order interactive features. However, gradually propagating the node representations with only one adjacency matrix fails to express higherorder () interactions. Thus, to generate interactive features of at highest order, we extend the feature graph by defining an adjacency tensor to indicate the interactions among features at each order, where each slice represents a layerwise adjacency matrix and is the number of original features (nodes). Once an entry is active, we intend to generate a order feature based on node by and synthesize these into . Formally, with an adjacency tensor , our dedicated graph convolutional operator produces the node representations layerbylayer, in the following way:
(2) 
Here “MEAN” is adopted as the aggregator and is the transformation matrix for node . Assume that the capacity of our GNN and embedding matrix is sufficient for to express , we can show that the node representation at th layer corresponds to the generated order interactive features:
where the choices of which features should be combined are determined by the adjacency tensor .
As shown above, the feature graph and the associated GNN is capable of conceptually expressing our inductive search strategy. Thus from the perspective of feature graph, the task of generating interactive features is equivalent to learning an optimal adjacency tensor , socalled edge search in our study. In order to evaluate the quality of generated features, i.e., the learned adjacency tensor , we apply a linear output layer to the concatenation of node representations at each layer:
(3) 
where and are the projection matrix and bias term respectively, and
denotes the sigmoid function. We pack all the parameters as
. Then we define a crossentropy loss function for the joint of
and :(4) 
where is the considered dataset, denotes the ground truth label of th instance from , and denotes its prediction based on node representations from th layer.
Eventually, the edge search task could be formulated as a bilevel optimization problem:
(5) 
where and denote the training and the validation dataset respectively. Such a nested formulation in Eq. (5) has also been studied recently in differentiable NAS liu2018darts ; cai2018proxylessnas ; noy2019asap , where the architecture parameters ( in our formulation) and network parameters ( in our formulation) are alternatively updated during the search procedure. However, most of NAS methods ignored the dependencies among architecture parameters zhou2019bayesnas , which are critical for our task as the higherorder interactive features are generated based on the choice of previous layers.
2.4 Differentiable Search
Directly solving the optimization problem in Eq. (5) is intractable because of its bilevel nature and the binary values of . Existing methods AutoInt song2019autoint and FiGNN li2019fi tackle the issue of binary by calculating it onthefly, that is, the set of features to be aggregated for interaction is determined by a selfattention layer. This solution enables efficient optimization, but the attention weights dynamically change from sample to sample. Thus it is hard to interpret these attention weights to know which interactive features should be generated. On the other hand, there are some straightforward ways to learn a stationary adjacency tensor. To be specific, we can regard as Bernoulli random variables parameterized by . Then, in the forward phase, we sample each slice ; and in the backward phase, we update
based on straightthrough estimation (STE)
bengio2013estimating . This technique has also been adopted for solving problems like Eq. (5) in some NAS studies cai2018proxylessnas .Following the search strategy in Section 2.2, the adjacency tensor should be determined slicebyslice from to . In addition, since that which order features should be generated depend on those order features have been generated, the optimization of should be conditioned on . Our inductive search strategy would be precisely instantiated, only when such dependencies are modeled. Thus, we parameterize the adjacency tensor by in this recursive way:
(6) 
where
is a binarization function with a tunable threshold, and
is the degree matrix of serving as a normalizer.We illustrate the intuition behind Eq. (6) in Figure 1. Specifically, if we regard each order interactive feature, e.g., as a hop path jumping from to , then can be treated as a binary sample drawn from the hop transition matrix where indicates the hop visibility (or say accessibility) from to . Conventionally, the transition matrix of hop can be calculated by multiplying that of hop with the normalized adjacency matrix. Following the motivation of defining an adjacency tensor such that the topological structures at different layers tend to vary from each other, we design as a layerwise (unnormalized) transition matrix. In this way, with Eq. (6), we make the adjacency matrix of each layer depend on that of previous layer, which exactly instantiates our search strategy.
Since there is a binarization function in Eq. (6), cannot be directly optimized via a differentiable way w.r.t. the loss function in Eq. (4). One possible solution is to use policy gradient sutton2000policy , Gumbelmax trick jang2017categorical , or other approximations nayman2019xnas . However, it can be inefficient when the action space (here the possible interactions) is too large.
To make the optimization more efficient, we allow to use a soft for propagation at the th layer, while the calculation of still depends on a binarized :
(7) 
However, the entries of an optimized may still lie near the borderline (around ). When we use these borderline values for generating interactive features, the gap between the binary decisions and the learned cannot be neglected noy2019asap . To fill this gap, we rescale each entry of through dividing it by a temperature before using it for propagation. As anneals from to a small value, e.g., along the search phase, the rescaled value becomes close to or . We illustrates how this mechanism works in supplementary material.
Finally, our modeling allows us to solve the optimization problem in Eq. (5) with gradient descent method. The whole optimization algorithm is summarized in Algorithm 1. By Algorithm 1, the learned can solely serve as a predictive model. Moreover, we are allowed to specify layerwise thresholds for binarizing the learned and inductively derive the suggested useful order () interactive features }.
3 Experiments
We conduct a series of experiments to demonstrate the effectiveness of the proposed FIVES method, with the aims to answer the following questions. Q1: When the learned of FIVES solely serves as a predictive model, how it performs compared to stateoftheart feature generation methods? Q2: Could we boost the performance of some lightweight models with the interactive features generated by FIVES? Q3: Are the interactions indicated by the learned adjacency tensor really useful? Q4: How do different components of FIVES contribute to its performance?
Datasets. We conduct experiments on five benchmark datasets that are widely adopted in related works, and we also include two more realworld business datasets. The statistics of these datasets are summarized in Table 1, where the data are randomly partitioned to ensure a fair comparison.
Employee  Bank  Adult  Credit  Criteo  Business1  Business2  

# Features  9  20  42  16  39  53  59 
# Train  29,493  27,459  32,561  100,000  41,256K  1,572K  25,078K 
# Test  3,278  13,729  16,281  50,000  4,584K  673K  12,537K 
Preprocessing. We discretize numeric features into
equalwidth buckets. Then the numeric features are transformed into onehot vector representations according to their bucket indices. This follows the multigranularity discretization proposed in
luo2019autocross . For all the rare feature category values (whose frequency is less than ), we assign them the same identifier.Metric. Following existing works, we use AUC to evaluate the predictive performance. A higher AUC indicates a better performance. As has been pointed out in the previous studies cheng2016wide ; luo2019autocross ; song2019autoint , a small improvement (at 0.001level) in offline AUC evaluation can make a significant difference in realworld business predictive tasks such as CTR prediction in advertisements.
3.1 FIVES as a predictive model (Q1)
As mentioned in Section 2.4, the learned
of FIVES is a predictive model by itself. We adopt the following methods as baselines, including those frequently used in practical recommender systems and stateoftheart feature generation methods: (1) LR: Logistic Regression with only the original features (more settings of LR will be given in next part). (2) DNN: The standard Deep Neural Network with fully connected cascade and a output layer with sigmoid function. (3) FM
rendle2010factorization : The factorization machine uses the inner product of two original features to express their interactions. (4) Wide&Deep cheng2016wide : This method jointly trains wide linear models and deep neural networks. (5) AutoInt song2019autoint: A DNNbased feature generation method, in which the multihead selfattentive neural network with residual connections is proposed to model feature interactions. (6) FiGNN
li2019fi : It proposes to represent the multifield features as a graph structure for the first time, and the interactions of features are modeled as the attentional edge weights. The implementation details of these methods can be found in supplementary material.After hyperparameter optimization for all methods (see supplementary material for details), we use the optimal configuration to run each method for 10 times and conduct independent ttest between the results of FIVES and the strongest baseline method to show the significance: “
” represents and “” represents . The experimental results are summarized in Table 2.Method  Employee  Bank  Adult  Credit  Criteo  Business1  Business2 

LR  0.8353  0.9377  0.8836  0.8262  0.7898  0.6912  0.7121 
DNN  0.8510  0.9435  0.8869  0.8292  0.7779  0.6927  0.6841 
FM  0.8473  0.9434  0.8847  0.8278  0.7836  0.6888  0.7152 
Wide&Deep  0.8484  0.9416  0.8870  0.8299  0.7710  0.6941  0.7128 
AutoInt  0.8397  0.9393  0.8869  0.8301  0.7993  0.6960  0.7237 
FiGNN  0.7968  0.9417  0.8813  0.8276  0.7702  0.6882  0.7103 
FIVES  0.8536  0.9446  0.8863  0.8307  0.8006  0.6984  0.7276 
The experimental results demonstrate that FIVES can significantly improve the performance compared to baseline methods for most datasets. Especially on the largescale datasets, FIVES outperforms all other baseline methods by a considerable large margin. As the models are offline trained, we leave the efficiency comparisons in our supplementary material.
3.2 FIVES as a feature generator (Q2)
In practice, the generated interactive features are often used to augment original features, and then all of them are fed into some lightweight models (e.g., LR) to meet the requirement of inference speed. As aforementioned, we can explicitly derive useful interactive features from the learned adjacency tensor . We call this method FIVES+LR and compare it with the following methods: (1) Random+LR: LR with original features and randomly selected interactive features; (2) CMI+LR: conditional mutual information (CMI) as a filter to select useful interactive features from all possible order interactions; (3) AutoCross+LR luo2019autocross : a recent searchbased method, which performs beam search in a treestructured space. We also run the experiments for 10 times and analyze the results by ttest to draw statistically significant conclusions. The results are summarized in Table 3.
Method  Employee  Bank  Adult  Credit  Criteo  Business1  Business2 

LR  0.8353  0.9377  0.8836  0.8262  0.7898  0.6912  0.7121 
Random+LR  0.8255  0.9373  0.8777  0.8258  0.7804  0.6927  0.7137 
CMI+LR  0.8423  0.9370  0.8780  0.8264  0.7728  0.6941  0.7253 
AutoCross+LR  0.8529  0.9393  0.8771  0.8274  0.7902  0.6916  0.7122 
FIVES+LR  0.8532  0.9378  0.8850  0.8274  0.7924  0.6946  0.7257 
On all the datasets, the interactive features FIVES indicated consistently improve the performance of a LR. The improvements are larger than that of other searchbased method on most datasets, which is particularly significant on the largest one (i.e., Criteo). The degeneration of some reference methods may be caused by redundant features that harden the optimization of a LR.
3.3 Usefulness of generated interactive features (Q3)
The above experiments evaluate the usefulness of the interactive features generated by FIVES from the perspective of augmenting the feature set for lightweight models. Here, we further evaluate the usefulness of generated interactive features from more different perspectives.
First, we can directly calculate the AUCs of making predictions by each interactive feature. Then we plot the features as points in Fig. 2, with the value of their entries as xaxis and corresponding AUC values as yaxis, which illustrates a positive correlation between and the feature’s AUC. This correlation confirms that the generated interactive features are indeed useful.
Another way to assess the usefulness of generated interactive features is to compare different choices of the adjacency tensor . From the view of NAS, acts as the architecture parameters. If it represents a useful architecture, we shall achieve satisfactory performance by tuning the model parameters w.r.t. it. Specifically, we can fix a well learned and learn just the predictive model parameters . We consider the following different settings: (1) is learned from scratch (denoted as LFS); (2) is finetuned with the output of Algorithm 1 as its initialization (denoted as FT). Besides, we also show the performance of a random architecture as the baseline (denoted as Random).
The experimental results in Figure 3 show the advantage of the searched architecture against a random one. Since the architecture parameter indicates which interactive features should be generated, the effectiveness of searched architecture confirms the usefulness of generated interactive features. On the contrary, the randomly generated interactions can degrade the performance, especially when the original feature is relatively scarce, e.g., on Employee dataset. Meanwhile, the improvement from “FIVES” to “FT” shows that, given a searched architecture (i.e., learned ), improvement for predictive performance can be achieved via finetuning .
3.4 Contributions of different components (Q4)
3.4.1 Modeling the dependencies between adjacency matrices
As zhou2019bayesnas pointed out that most existing NAS works ignore the dependencies between architecture parameters, which is indispensable to precisely express our inductive search strategy. FIVES models such dependencies by recursively defining the adjacency matrices (see Eq. (6)). In contrast, we define a variant of FIVES without considering such dependencies by parameterizing each with its specific : . We conduct an ablation study to compare our recursive modeling (Eq. (6)) with this independence modeling, and the results in Figure 4 clearly confirmed the effectiveness of the recursive modeling FIVES adopted.
3.4.2 Filling the gap between differentiable search and hard decision
Since a soft is used in graph convolution (see Eq. (7)), the learned may lie near the borderline, which causes a gap between the binary decisions and the soft used during search phase. As noy2019asap observed, such kind of gap hurts the performance of NAS. To tackle this, we rescale with the temperature annealed. Here is an ablation study to see the necessary of such a rescaling.
From Figure 5, we observe that the rescaling can noticeably reduce the performance gap between FIVES and FIVES+LR. Without this rescaling, the interactions evaluated during search phase consist of softly weighted ones, which are inconsistent with the interactive features fed into LR. By incorporating the rescaling, almost all entries of are forced to be binary before being taken into considering by next layer, so that the decisions made at order are based on the exact order interactive features that will be fed into LR. In this way, the edge search procedure would be led by exact evaluation of what to be generated.
4 Conclusions
Motivated by our theoretical analysis, we propose FIVES, an automatic feature generation method where an adjacency tensor is designed to indicate which feature interactions should be made. The usefulness of indicated interactive features is confirmed from different perspectives in our empirical studies. FIVES provides such interpretability by solving for the optimal adjacency tensor in a differentiable manner, which is much more efficient than searchbased method, while also preserving the efficiency benefit of DNNbased method. Extensive experiments show the advantages of FIVES as both a predictive model and a feature generator.
5 Appendix
5.1 Proof of Proposition 1
We first restate the Proposition 1 in the paper here.
Proposition 2.
Let and be Bernoulli random variables with a joint conditional probability mass function, such that . Suppose further that mutual information between and satisfies where and is a nonnegative constant. If and are weakly correlated given , that is, , we have
(8) 
Proof.
As explained in Section 2.2 of the paper, we use to denote the interaction . By Definition 1, the interaction is defined to be the Cartesian product of the individual features. In this sense, could be regarded as a random variable constructed by some bijective mapping from the tuple . In our method, the interaction is expressed via the graph convolutional operator. Although we have assumed such a modeling to be expressive enough, for rigorous analysis, we’d better regard as a noninjective mapping from to , and thus we have .
Therefore:
(9) 
where is the so called incremental entropy .
Note that and , we further have
(10) 
To prove the Proposition, it remains to prove that:
(11) 
We start with deriving the mathematical expression for in terms of , . Note the following mathematical relations:
Using the above relations, it is straightforward to show that
(12) 
Next, we derive the expression for :
(13) 
Next, we expand the terms of as follows:
(15) 
Using the concavity of logarithm, we further have
(16) 
where the last inequality follows that . The above inequality holds for all , therefore,
(17) 
5.2 Rescaling Function
Since we allow to use a soft for propagation (see Eq. (7) in Section 2.4 of the paper), the gap between the binary decisions and the learned cannot be neglected noy2019asap . To fill this gap, we rescale each entry of through dividing it by a temperature , which can be formatted as:
(18) 
where denotes the entry of . As anneals from to a small value, e.g., along the search phase, the rescaled value becomes close to either or . Figure 6 illustrates how this mechanism works with different values of temperature .
5.3 Implementation and HPO Details
For FIVES and all baseline methods, we empirically set batch size as 128 for small datasets (Employee, Bank, Adult and Credit), and 1024 for large datasets (Criteo, Business1 and Business2). The learning rate of baseline methods is set to be 5e3. To overcome the overfitting issue, model parameters are regularized by regularization with the strength of 1e4 and the dropout rate is set to 0.3. We apply grid search for hyperparameter optimization (HPO). After the HPO procedure, we use the optimal configuration to train and evaluate each method for 10 times, alleviating the impact of randomness. For better parallelism and economic usage of computational resources, we conduct all our experiments on a large cloud platform. The source code will be released later. Other implementation details for each method are described as below:
FIVES. The tunable hyperparameters contain the highest order number of interactive features , the learning rate 5e3 and {5e3, 5e4}, the embedding dimension of node representation . The hidden dimension of GNN is set to be the same as the embedding dimension.
DNN. It is implemented by ourselves. The tunable hyperparameters contain the hidden dimension of fully connected layers and the embedding dimension of node representation .
FM. The cloud platform we used has provided some frequently used machine learning algorithms including FM. The tunable hyperparameters include learning rate and coefficients of regularization .
Wide&Deep. It is implemented by ourselves according to cheng2016wide . The tunable hyperparameters contain the hidden dimension of fully connected layers in the deep component and the embedding dimension of node representation .
AutoInt. It is reproduced by using the source code^{1}^{1}1https://github.com/DeepGraphLearning/RecommenderSystems published by song2019autoint . The tunable hyperparameters contain the number of blocks and the number of attention heads . The hidden dimension of interacting layers is set to 32 as suggested by the original paper and the embedding dimension of node representation .
FiGNN
. It is reproduced by ourselves via TensorFlow according to the source code
^{2}^{2}2https://github.com/CRIPACDIG/Fi_GNN published by li2019fi . The tunable hyperparameters contain the highest order number of interactive features , and the embedding dimension of node representation .AutoCross. We implement a special LR that either updates all the trainable parameters or updates only the parameters of newly added features. Then we implement a scheduler to trigger training and evaluation routines of the LR over different feature spaces. This fails to exploit the “reuse” trick proposed in luo2019autocross , but identically expresses their search strategy.
LR. We use the LR provided by the cloud platform, which is implemented based on parameterserver architecture. We set the regularization strength as 1.0 and regularization as 0. The maximum iteration is 100 and the toleration is 1e6.
5.4 Efficiency Comparisons
To study the efficiency of the proposed method FIVES, we empirically compare it against other DNNbased methods in terms of both convergence rate and runtime per epoch. As Figure 7 shows, although we formulate the edge search task as a bilevel optimization problem, FIVES achieves comparable validation AUC at each different number of training steps, which indicates a comparable convergence rate. Meanwhile, the runtime per epoch of all the DNNbased methods is at the same order, even though, the runtime per epoch of both FiGNN and FIVES is relatively larger than that of others due to the complexity of graph convolutional operator. Actually, with 4 Nvidia GTX1080Ti GPU cards, FIVES can complete its search phase on Criteo (traversal of 3 epochs) within hours. In contrast, both dozens of hours and hundreds of times of computational resources are needed for searchbased methods due to their trialanderror nature.
5.5 Datasets Availability
Business1 and Business2 are constructed by randomly sampling largescale search logs. Due to the privacy issue, they are not publicly available currently.
References
 (1) Bengio, Y., A. Courville, P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 (2) Gers, F. A., N. N. Schraudolph, J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(8):115–143, 2002.

(3)
Cho, K., B. van Merriënboer, C. Gulcehre, et al.
Learning phrase representations using RNN encoder–decoder for
statistical machine translation.
In
Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1724–1734. 2014.  (4) Krizhevsky, A., I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pages 1097–1105. 2012.
 (5) Vaswani, A., N. Shazeer, N. Parmar, et al. Attention is all you need. In Advances in neural information processing systems (NIPS), pages 5998–6008. 2017.
 (6) Katz, G., E. C. R. Shin, D. Song. Explorekit: Automatic feature generation and selection. In Proceedings of the International Conference on Data Mining (ICDM), pages 979–984. 2016.
 (7) Kaul, A., S. Maheshwary, V. Pudi. Autolearn—automated feature generation and selection. In Proceedings of the International Conference on Data Mining (ICDM), pages 217–226. 2017.
 (8) Hutter, F., L. Kotthoff, J. Vanschoren. Automated Machine Learning. Springer, 2019.
 (9) Kanter, J. M., K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10. 2015.
 (10) Luo, Y., M. Wang, H. Zhou, et al. Autocross: Automatic feature crossing for tabular data in realworld applications. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1936–1945. 2019.
 (11) Rendle, S. Factorization machines. In Proceedings of the International Conference on Data Mining (ICDM), pages 995–1000. 2010.
 (12) Juan, Y., Y. Zhuang, W.S. Chin, et al. Fieldaware factorization machines for ctr prediction. In Proceedings of the Conference on Recommender Systems (RecSys), pages 43–50. 2016.
 (13) Shan, Y., T. R. Hoens, J. Jiao, et al. Deep crossing: Webscale modeling without manually crafted combinatorial features. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), page 255–262. 2016.

(14)
Guo, H., R. TANG, Y. Ye, et al.
DeepFM: A factorizationmachine based neural network for ctr
prediction.
In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)
, pages 1725–1731. 2017.  (15) Lian, J., X. Zhou, F. Zhang, et al. xDeepFM: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1754–1763. 2018.
 (16) Xu, K., W. Hu, J. Leskovec, et al. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR). 2019.
 (17) Li, Z., Z. Cui, S. Wu, et al. FiGNN: Modeling feature interactions via graph neural networks for ctr prediction. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 539–548. 2019.
 (18) Song, W., C. Shi, Z. Xiao, et al. AutoInt: Automatic feature interaction learning via selfattentive neural networks. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1161–1170. 2019.
 (19) Liu, B., C. Zhu, G. Li, et al. Autofis: Automatic feature interaction selection in factorization models for clickthrough rate prediction. arXiv preprint arXiv:2003.11235, 2020.
 (20) Liu, H., K. Simonyan, Y. Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations (ICLR). 2019.
 (21) Noy, A., N. Nayman, T. Ridnik, et al. Asap: Architecture search, anneal and prune. arXiv preprint arXiv:1904.04123, 2019.
 (22) Zhou, H., M. Yang, J. Wang, et al. BayesNAS: A Bayesian approach for neural architecture search. In Proceedings of the International Conference on Machine Learning (ICML), vol. 97, pages 7603–7613. 2019.
 (23) Cai, H., L. Zhu, S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR). 2019.
 (24) Bengio, Y., N. Léonard, A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.

(25)
Sutton, R. S., D. A. McAllester, S. P. Singh, et al.
Policy gradient methods for reinforcement learning with function approximation.
In Advances in neural information processing systems (NIPS), pages 1057–1063. 2000.  (26) Jang, E., S. Gu, B. Poole. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations (ICLR). 2017.
 (27) Nayman, N., A. Noy, T. Ridnik, et al. Xnas: Neural architecture search with expert advice. In Advances in neural information processing systems (NeurIPS), pages 1977–1987. 2019.

(28)
Cheng, H.T., L. Koc, J. Harmsen, et al.
Wide & deep learning for recommender systems.
In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10. 2016.