Interactive Feature Generation via Learning Adjacency Tensor of Feature Graph

by   Yuexiang Xie, et al.

To automate the generation of interactive features, recent methods are proposed to either explicitly traverse the interactive feature space or implicitly express the interactions via intermediate activations of some designed models. These two kinds of methods show that there is essentially a trade-off between feature interpretability and efficient search. To possess both of their merits, we propose a novel method named Feature Interaction Via Edge Search (FIVES), which formulates the task of interactive feature generation as searching for edges on the defined feature graph. We first present our theoretical evidence that motivates us to search for interactive features in an inductive manner. Then we instantiate this search strategy by alternatively updating the edge structure and the predictive model of a graph neural network (GNN) associated with the defined feature graph. In this way, the proposed FIVES method traverses a trimmed search space and enables explicit feature generation according to the learned adjacency tensor of the GNN. Experimental results on both benchmark and real-world datasets demonstrate the advantages of FIVES over several state-of-the-art methods.


page 1

page 2

page 3

page 4


Edge-featured Graph Neural Architecture Search

Graph neural networks (GNNs) have been successfully applied to learning ...

GraphFM: Graph Factorization Machines for Feature Interaction Modeling

Factorization machine (FM) is a prevalent approach to modeling pairwise ...

Search For Deep Graph Neural Networks

Current GNN-oriented NAS methods focus on the search for different layer...

Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction

Click-through rate (CTR) prediction is an essential task in web applicat...

Who Will Support My Project? Interactive Search of Potential Crowdfunding Investors Through InSearch

Crowdfunding provides project founders with a convenient way to reach on...

Designing the Topology of Graph Neural Networks: A Novel Feature Fusion Perspective

In recent years, Graph Neural Networks (GNNs) have shown superior perfor...

Online Edge Grafting for Efficient MRF Structure Learning

Incremental methods for structure learning of pairwise Markov random fie...

1 Introduction

Data representation learning bengio2013representation

is an important yet challenging task in machine learning community. Recently, deep neural networks (DNN) have made impressive progress in representation learning for various types of data, including speech, image, and natural language, in which dedicated networks such as RNN 

gers2002learning ; cho2014learning , CNN krizhevsky2012imagenet and Transformer vaswani2017attention are effective at learning compact representations to improve the predictive performance. However, many real-world applications, such as click-through rate prediction in recommendation and online advertising, still rely on feature engineering to explore higher-order feature space, where combinatorial features are crafted. For example, a -order interactive feature “Gender Age Income” can be discriminative for the types of recommended commodities. Such high-order interactive features can introduce non-linearity to the lightweight models used in practice, with negligible decrease in their inference speed. Besides, they have strong interpretability that would otherwise be weak due to the black-box nature of DNN.

In practice, the task of generating interactive features needs to evaluate a large number of candidates and compare their performances. Since a brute-force method is intolerable, traditional interactive feature generation methods heavily rely on the experience and knowledge of domain experts, which requires time-consuming and task-specific efforts. This motivates automatic feature generation katz2016explorekit ; kaul2017autolearn , one major topic of Automated Machine Learning (AutoML) hutter2019automated , which has attracted increasing attention from both academia and industry.

Recent works on automatic feature generation can be roughly divided into two categories: search-based kanter2015deep ; katz2016explorekit ; luo2019autocross and DNN-based rendle2010factorization ; juan2016field ; shen2016deepcross ; ijcai2017deep ; lian2018xdeepfm ; xu2018how ; li2019fi ; song2019autoint ; liu2020autofis . The search-based methods focus on designing different search strategies that prune as much of the candidates to be evaluated as possible, while aiming to keep the most useful interactive features. For example, ExploreKit katz2016explorekit evaluates only the top-ranked candidates where the feature ranker has been pre-trained on other datasets. AutoCross luo2019autocross incrementally selects the optimal candidate from the pairwise interactions of current features. Although these mechanisms can trim the search space to be traversed, due to their trial-and-error nature, the needed time and computing resources are usually intolerable in practice. On the other hand, the DNN-based methods design specific neural architectures to express the interactions among different features. For example, AutoInt song2019autoint and Fi-GNN li2019fi exploit the self-attention mechanism vaswani2017attention for weighting the features in their synthesis, and they achieve leading performances with just a one-shot training course. But this advantage comes at the cost of implicit feature interactions as it is hard to exactly interpret which interactive features are useful from the attention weights. Actually, it is in great demand that useful interactive features can be explicitly expressed, since they can be incorporated to train some lightweight predictive models to satisfy the requirement of real-time inference.

To possess both of their merits, that is, the interpretability of search-based methods and the efficiency of DNN-based methods, we propose a novel method for automatic feature generation, called Feature Interaction Via Edge Search (FIVES). First, we propose to inductively search for an optimal collection of -order features from the interactions between the generated -order features and the original features. A theoretical analysis (see Proposition 1) is provided to explain the intuition behind this search strategy—informative interaction features tend to come from the informative lower-order ones. Then we instantiate this inductive search strategy by modeling the features as a feature graph and expressing the interactions between nodes as propagating the graph signal via a designed graph neural network (GNN). In the defined feature graph, the layer-wise adjacency matrix determines which original features should be selected to interact with which of the features generated by the previous layer. In this way, we formulate the task of interactive feature generation as edge search—learning the adjacency tensor. Inspired by differentiable neural architecture search (NAS) liu2018darts ; noy2019asap , we solve the edge search problem by alternatively updating the adjacency tensor and the predictive model, which transforms the trial-and-error procedure into a differentiable NAS problem. In particular, the learned adjacency tensor can explicitly indicate which interactive features are useful. We also parameterize the adjacency matrices in a recursive way where the dependencies are conceptually required by our search strategy. Such kinds of dependencies are ignored by most NAS methods zhou2019bayesnas . To validate the proposed method, we compare FIVES with both search-based and DNN-based methods, on five benchmark datasets and two real-world busisness datasets. The experimental results demonstrate the advantages of FIVES, as both an end-to-end predictive model and a feature generator for other lightweight predictive models.

2 Methodology

2.1 Preliminary

We consider the ubiquitous tabular data where each column represents a feature and each row represents a sample. Without loss of generality, we assume that numeric features have been discretized and thus all considered features are categorical ones, for example, feature “user city” takes value from , and feature “user age” takes value from . Based on these, we first define the high-order interactive feature:

Definition 1.

Given original features , a -order () interactive feature can be represented as the Cartesian product of distinct original features where each feature is selected from .

Since the interactive features bring in non-linearity, e.g., the Cartesian product of two binary features enables a linear model to predict the label of their XOR relation, they are widely adopted to improve the performance of machine learning methods on tabular data. The goal of interactive feature generation is to find a set of such interactive features.

As the number of all possible interactions from original features is , it is challenging to search for the optimal set of interactive features from such a huge space, not to mention the evaluation of a proposed feature can be costly and noisy. Thus some recent methods model the search procedure as the optimization of designed DNNs, transforming it into a one-shot training at the cost of interpretability. In the rest of this section, we present our proposed method—FIVES, which can provide explicit interaction results via an efficient differentiable search.

2.2 Search Strategy

As aforementioned, exhaustively traversal of the exponentially growing interactive feature space seems intractable. By Definition 1, any -order interactive features could be regarded as the interaction (i.e., Cartesian product) of several lower-order features, with many choices of the decomposition. This raises a question that could we solve the task of generating interactive features in a bottom-up manner? To be specific, could we generate interactive features in an inductive manner, that is, searching for a group of informative -order features from the interactions between original features and the group of -order features identified in previous step? We present the following proposition to provide theoretical evidence for discussing the question.

Proposition 1.

Let and

be Bernoulli random variables with a joint conditional probability mass function,

such that . Suppose further that mutual information between and satisfies where and is a non-negative constant. If and are weakly correlated given , that is, , we have


We defer the proof to the supplementary material. Specifically, the random variable and stands for the feature and the label respectively, and the joint of s stands for their interaction. Recall that: 1) As the considered raw features are categorical, modeling each feature as a Bernoulli random variable would not sacrifice much generality; 2) In practice, the raw features are often pre-processed to remove redundant ones, so the weak correlation assumption holds. Based on these, our proposition indicates that, for small we have , thereby the information gain introduced by interaction of features is at most that of the individuals. This proposition therefore could be interpreted as—under practical assumptions, it is unlikely to construct an informative feature from the interaction of uninformative ones.

This proposition supports the bottom-up search strategy, as lower-order features that have not been identified as informative are less likely to be a useful building brick of high-order features. Besides, the identified -order features are recursively constructed from the identified ones of lower-orders, and thus they are likely to include sufficient information for generating informative -order features. We also empirically validate this strategy in Section 3.2 and Section 3.3. Although this inductive search strategy cannot guarantee to generate all useful interactive features, the generated interactive features in such a way are likely to be useful ones, based on the above proposition. This can be regarded as a trade-off between the usefulness of generated interactive features and the completeness of them, under the constraint of limited computation resources.

2.3 Modeling

To instantiate our inductive search strategy, we conceptually regard the original features as a feature graph and model the interactions among features by a designed GNN.

First, we denote the feature graph as where each node corresponds to a feature and each edge indicates an interaction between node and node . We use as the initial node representation for node that conventionally takes the embedding looked up by from the feature embedding matrix as its value. It is easy to show that, by applying a vanilla graph convolutional operator to such a graph, the output is capable of expressing the -order interactive features. However, gradually propagating the node representations with only one adjacency matrix fails to express higher-order () interactions. Thus, to generate interactive features of at highest -order, we extend the feature graph by defining an adjacency tensor to indicate the interactions among features at each order, where each slice represents a layer-wise adjacency matrix and is the number of original features (nodes). Once an entry is active, we intend to generate a -order feature based on node by and synthesize these into . Formally, with an adjacency tensor , our dedicated graph convolutional operator produces the node representations layer-by-layer, in the following way:


Here “MEAN” is adopted as the aggregator and is the transformation matrix for node . Assume that the capacity of our GNN and embedding matrix is sufficient for to express , we can show that the node representation at -th layer corresponds to the generated -order interactive features:

where the choices of which features should be combined are determined by the adjacency tensor .

As shown above, the feature graph and the associated GNN is capable of conceptually expressing our inductive search strategy. Thus from the perspective of feature graph, the task of generating interactive features is equivalent to learning an optimal adjacency tensor , so-called edge search in our study. In order to evaluate the quality of generated features, i.e., the learned adjacency tensor , we apply a linear output layer to the concatenation of node representations at each layer:


where and are the projection matrix and bias term respectively, and

denotes the sigmoid function. We pack all the parameters as

. Then we define a cross-entropy loss function for the joint of

and :


where is the considered dataset, denotes the ground truth label of -th instance from , and denotes its prediction based on node representations from -th layer.

Eventually, the edge search task could be formulated as a bilevel optimization problem:


where and denote the training and the validation dataset respectively. Such a nested formulation in Eq. (5) has also been studied recently in differentiable NAS liu2018darts ; cai2018proxylessnas ; noy2019asap , where the architecture parameters ( in our formulation) and network parameters ( in our formulation) are alternatively updated during the search procedure. However, most of NAS methods ignored the dependencies among architecture parameters zhou2019bayesnas , which are critical for our task as the higher-order interactive features are generated based on the choice of previous layers.

2.4 Differentiable Search

Directly solving the optimization problem in Eq. (5) is intractable because of its bilevel nature and the binary values of . Existing methods AutoInt song2019autoint and Fi-GNN li2019fi tackle the issue of binary by calculating it on-the-fly, that is, the set of features to be aggregated for interaction is determined by a self-attention layer. This solution enables efficient optimization, but the attention weights dynamically change from sample to sample. Thus it is hard to interpret these attention weights to know which interactive features should be generated. On the other hand, there are some straightforward ways to learn a stationary adjacency tensor. To be specific, we can regard as Bernoulli random variables parameterized by . Then, in the forward phase, we sample each slice ; and in the backward phase, we update

based on straight-through estimation (STE) 

bengio2013estimating . This technique has also been adopted for solving problems like Eq. (5) in some NAS studies cai2018proxylessnas .

Following the search strategy in Section 2.2, the adjacency tensor should be determined slice-by-slice from to . In addition, since that which -order features should be generated depend on those -order features have been generated, the optimization of should be conditioned on . Our inductive search strategy would be precisely instantiated, only when such dependencies are modeled. Thus, we parameterize the adjacency tensor by in this recursive way:



is a binarization function with a tunable threshold, and

is the degree matrix of serving as a normalizer.

Figure 1: The intuition behind Eq. (6).

We illustrate the intuition behind Eq. (6) in Figure 1. Specifically, if we regard each -order interactive feature, e.g., as a -hop path jumping from to , then can be treated as a binary sample drawn from the -hop transition matrix where indicates the -hop visibility (or say accessibility) from to . Conventionally, the transition matrix of -hop can be calculated by multiplying that of -hop with the normalized adjacency matrix. Following the motivation of defining an adjacency tensor such that the topological structures at different layers tend to vary from each other, we design as a layer-wise (unnormalized) transition matrix. In this way, with Eq. (6), we make the adjacency matrix of each layer depend on that of previous layer, which exactly instantiates our search strategy.

Since there is a binarization function in Eq. (6), cannot be directly optimized via a differentiable way w.r.t. the loss function in Eq. (4). One possible solution is to use policy gradient sutton2000policy , Gumbel-max trick jang2017categorical , or other approximations nayman2019xnas . However, it can be inefficient when the action space (here the possible interactions) is too large.

To make the optimization more efficient, we allow to use a soft for propagation at the -th layer, while the calculation of still depends on a binarized :


However, the entries of an optimized may still lie near the borderline (around ). When we use these borderline values for generating interactive features, the gap between the binary decisions and the learned cannot be neglected noy2019asap . To fill this gap, we re-scale each entry of through dividing it by a temperature before using it for propagation. As anneals from to a small value, e.g., along the search phase, the re-scaled value becomes close to or . We illustrates how this mechanism works in supplementary material.

Finally, our modeling allows us to solve the optimization problem in Eq. (5) with gradient descent method. The whole optimization algorithm is summarized in Algorithm 1. By Algorithm 1, the learned can solely serve as a predictive model. Moreover, we are allowed to specify layer-wise thresholds for binarizing the learned and inductively derive the suggested useful -order () interactive features }.

0:  Feature graph , highest order , learning rate

, and #epochs

0:  Adjacency tensor , parameter of predictive model
1:  Initialize and ; and split data into and ;
2:  for  do
3:     Calculate according to Eq. (7);
4:     Propagate the graph signal for times according to Eq. (2);
5:     Update by descending ;
6:     Update by descending ;
7:  end for
Algorithm 1 Optimization Algorithm for FIVES

3 Experiments

We conduct a series of experiments to demonstrate the effectiveness of the proposed FIVES method, with the aims to answer the following questions. Q1: When the learned of FIVES solely serves as a predictive model, how it performs compared to state-of-the-art feature generation methods? Q2: Could we boost the performance of some lightweight models with the interactive features generated by FIVES? Q3: Are the interactions indicated by the learned adjacency tensor really useful? Q4: How do different components of FIVES contribute to its performance?

Datasets. We conduct experiments on five benchmark datasets that are widely adopted in related works, and we also include two more real-world business datasets. The statistics of these datasets are summarized in Table 1, where the data are randomly partitioned to ensure a fair comparison.

Employee Bank Adult Credit Criteo Business1 Business2
# Features 9 20 42 16 39 53 59
# Train 29,493 27,459 32,561 100,000 41,256K 1,572K 25,078K
# Test 3,278 13,729 16,281 50,000 4,584K 673K 12,537K
Table 1: Statistics of datasets.

Preprocessing. We discretize numeric features into

equal-width buckets. Then the numeric features are transformed into one-hot vector representations according to their bucket indices. This follows the multi-granularity discretization proposed in 

luo2019autocross . For all the rare feature category values (whose frequency is less than ), we assign them the same identifier.

Metric. Following existing works, we use AUC to evaluate the predictive performance. A higher AUC indicates a better performance. As has been pointed out in the previous studies cheng2016wide ; luo2019autocross ; song2019autoint , a small improvement (at 0.001-level) in offline AUC evaluation can make a significant difference in real-world business predictive tasks such as CTR prediction in advertisements.

3.1 FIVES as a predictive model (Q1)

As mentioned in Section 2.4, the learned

of FIVES is a predictive model by itself. We adopt the following methods as baselines, including those frequently used in practical recommender systems and state-of-the-art feature generation methods: (1) LR: Logistic Regression with only the original features (more settings of LR will be given in next part). (2) DNN: The standard Deep Neural Network with fully connected cascade and a output layer with sigmoid function. (3) FM 

rendle2010factorization : The factorization machine uses the inner product of two original features to express their interactions. (4) Wide&Deep cheng2016wide : This method jointly trains wide linear models and deep neural networks. (5) AutoInt song2019autoint

: A DNN-based feature generation method, in which the multi-head self-attentive neural network with residual connections is proposed to model feature interactions. (6) Fi-GNN 

li2019fi : It proposes to represent the multi-field features as a graph structure for the first time, and the interactions of features are modeled as the attentional edge weights. The implementation details of these methods can be found in supplementary material.

After hyperparameter optimization for all methods (see supplementary material for details), we use the optimal configuration to run each method for 10 times and conduct independent t-test between the results of FIVES and the strongest baseline method to show the significance: “

” represents and “” represents . The experimental results are summarized in Table  2.

Method Employee Bank Adult Credit Criteo Business1 Business2
LR 0.8353 0.9377 0.8836 0.8262 0.7898 0.6912 0.7121
DNN 0.8510 0.9435 0.8869 0.8292 0.7779 0.6927 0.6841
FM 0.8473 0.9434 0.8847 0.8278 0.7836 0.6888 0.7152
Wide&Deep 0.8484 0.9416 0.8870 0.8299 0.7710 0.6941 0.7128
AutoInt 0.8397 0.9393 0.8869 0.8301 0.7993 0.6960 0.7237
Fi-GNN 0.7968 0.9417 0.8813 0.8276 0.7702 0.6882 0.7103
FIVES 0.8536 0.9446 0.8863 0.8307 0.8006 0.6984 0.7276
Table 2: Performance comparison in terms of AUC.

The experimental results demonstrate that FIVES can significantly improve the performance compared to baseline methods for most datasets. Especially on the large-scale datasets, FIVES outperforms all other baseline methods by a considerable large margin. As the models are offline trained, we leave the efficiency comparisons in our supplementary material.

3.2 FIVES as a feature generator (Q2)

In practice, the generated interactive features are often used to augment original features, and then all of them are fed into some lightweight models (e.g., LR) to meet the requirement of inference speed. As aforementioned, we can explicitly derive useful interactive features from the learned adjacency tensor . We call this method FIVES+LR and compare it with the following methods: (1) Random+LR: LR with original features and randomly selected interactive features; (2) CMI+LR: conditional mutual information (CMI) as a filter to select useful interactive features from all possible -order interactions; (3) AutoCross+LR luo2019autocross : a recent search-based method, which performs beam search in a tree-structured space. We also run the experiments for 10 times and analyze the results by t-test to draw statistically significant conclusions. The results are summarized in Table 3.

Method Employee Bank Adult Credit Criteo Business1 Business2
LR 0.8353 0.9377 0.8836 0.8262 0.7898 0.6912 0.7121
Random+LR 0.8255 0.9373 0.8777 0.8258 0.7804 0.6927 0.7137
CMI+LR 0.8423 0.9370 0.8780 0.8264 0.7728 0.6941 0.7253
AutoCross+LR 0.8529 0.9393 0.8771 0.8274 0.7902 0.6916 0.7122
FIVES+LR 0.8532 0.9378 0.8850 0.8274 0.7924 0.6946 0.7257
Table 3: Performance comparison in terms of AUC.

On all the datasets, the interactive features FIVES indicated consistently improve the performance of a LR. The improvements are larger than that of other search-based method on most datasets, which is particularly significant on the largest one (i.e., Criteo). The degeneration of some reference methods may be caused by redundant features that harden the optimization of a LR.

3.3 Usefulness of generated interactive features (Q3)

Figure 2: Correlation between and the AUC of the corresponding indicated feature.

The above experiments evaluate the usefulness of the interactive features generated by FIVES from the perspective of augmenting the feature set for lightweight models. Here, we further evaluate the usefulness of generated interactive features from more different perspectives.

First, we can directly calculate the AUCs of making predictions by each interactive feature. Then we plot the features as points in Fig. 2, with the value of their entries as x-axis and corresponding AUC values as y-axis, which illustrates a positive correlation between and the feature’s AUC. This correlation confirms that the generated interactive features are indeed useful.

Another way to assess the usefulness of generated interactive features is to compare different choices of the adjacency tensor . From the view of NAS, acts as the architecture parameters. If it represents a useful architecture, we shall achieve satisfactory performance by tuning the model parameters w.r.t. it. Specifically, we can fix a well learned and learn just the predictive model parameters . We consider the following different settings: (1) is learned from scratch (denoted as LFS); (2) is fine-tuned with the output of Algorithm 1 as its initialization (denoted as FT). Besides, we also show the performance of a random architecture as the baseline (denoted as Random).

Figure 3: Evaluation from the perspective of edge search

The experimental results in Figure 3 show the advantage of the searched architecture against a random one. Since the architecture parameter indicates which interactive features should be generated, the effectiveness of searched architecture confirms the usefulness of generated interactive features. On the contrary, the randomly generated interactions can degrade the performance, especially when the original feature is relatively scarce, e.g., on Employee dataset. Meanwhile, the improvement from “FIVES” to “FT” shows that, given a searched architecture (i.e., learned ), improvement for predictive performance can be achieved via fine-tuning .

3.4 Contributions of different components (Q4)

3.4.1 Modeling the dependencies between adjacency matrices

As zhou2019bayesnas pointed out that most existing NAS works ignore the dependencies between architecture parameters, which is indispensable to precisely express our inductive search strategy. FIVES models such dependencies by recursively defining the adjacency matrices (see Eq. (6)). In contrast, we define a variant of FIVES without considering such dependencies by parameterizing each with its specific : . We conduct an ablation study to compare our recursive modeling (Eq. (6)) with this independence modeling, and the results in Figure 4 clearly confirmed the effectiveness of the recursive modeling FIVES adopted.

Figure 4: Comparison of w/ and w/o modeling the dependencies between adjacency matrices

3.4.2 Filling the gap between differentiable search and hard decision

Since a soft is used in graph convolution (see Eq. (7)), the learned may lie near the borderline, which causes a gap between the binary decisions and the soft used during search phase. As noy2019asap observed, such kind of gap hurts the performance of NAS. To tackle this, we re-scale with the temperature annealed. Here is an ablation study to see the necessary of such a re-scaling.

Figure 5: Ablation study of the re-scaling applied to


From Figure 5, we observe that the re-scaling can noticeably reduce the performance gap between FIVES and FIVES+LR. Without this re-scaling, the interactions evaluated during search phase consist of softly weighted ones, which are inconsistent with the interactive features fed into LR. By incorporating the re-scaling, almost all entries of are forced to be binary before being taken into considering by next layer, so that the decisions made at -order are based on the exact -order interactive features that will be fed into LR. In this way, the edge search procedure would be led by exact evaluation of what to be generated.

4 Conclusions

Motivated by our theoretical analysis, we propose FIVES, an automatic feature generation method where an adjacency tensor is designed to indicate which feature interactions should be made. The usefulness of indicated interactive features is confirmed from different perspectives in our empirical studies. FIVES provides such interpretability by solving for the optimal adjacency tensor in a differentiable manner, which is much more efficient than search-based method, while also preserving the efficiency benefit of DNN-based method. Extensive experiments show the advantages of FIVES as both a predictive model and a feature generator.

5 Appendix

5.1 Proof of Proposition 1

We first re-state the Proposition 1 in the paper here.

Proposition 2.

Let and be Bernoulli random variables with a joint conditional probability mass function, such that . Suppose further that mutual information between and satisfies where and is a non-negative constant. If and are weakly correlated given , that is, , we have


As explained in Section 2.2 of the paper, we use to denote the interaction . By Definition 1, the interaction is defined to be the Cartesian product of the individual features. In this sense, could be regarded as a random variable constructed by some bijective mapping from the tuple . In our method, the interaction is expressed via the graph convolutional operator. Although we have assumed such a modeling to be expressive enough, for rigorous analysis, we’d better regard as a non-injective mapping from to , and thus we have .



where is the so called incremental entropy .

Note that and , we further have


To prove the Proposition, it remains to prove that:


We start with deriving the mathematical expression for in terms of , . Note the following mathematical relations:

Using the above relations, it is straightforward to show that


Next, we derive the expression for :


Combining (12) and (13), we have


Next, we expand the terms of as follows:


Using the concavity of logarithm, we further have


where the last inequality follows that . The above inequality holds for all , therefore,


By inserting (17) into (10), we finish the proof of Proposition 1. ∎

5.2 Re-scaling Function

Figure 6: Re-Scaling Function

Since we allow to use a soft for propagation (see Eq. (7) in Section 2.4 of the paper), the gap between the binary decisions and the learned cannot be neglected noy2019asap . To fill this gap, we re-scale each entry of through dividing it by a temperature , which can be formatted as:


where denotes the entry of . As anneals from to a small value, e.g., along the search phase, the re-scaled value becomes close to either or . Figure 6 illustrates how this mechanism works with different values of temperature .

5.3 Implementation and HPO Details

For FIVES and all baseline methods, we empirically set batch size as 128 for small datasets (Employee, Bank, Adult and Credit), and 1024 for large datasets (Criteo, Business1 and Business2). The learning rate of baseline methods is set to be 5e-3. To overcome the overfitting issue, model parameters are regularized by regularization with the strength of 1e-4 and the dropout rate is set to 0.3. We apply grid search for hyperparameter optimization (HPO). After the HPO procedure, we use the optimal configuration to train and evaluate each method for 10 times, alleviating the impact of randomness. For better parallelism and economic usage of computational resources, we conduct all our experiments on a large cloud platform. The source code will be released later. Other implementation details for each method are described as below:

FIVES. The tunable hyperparameters contain the highest order number of interactive features , the learning rate 5e-3 and {5e-3, 5e-4}, the embedding dimension of node representation . The hidden dimension of GNN is set to be the same as the embedding dimension.

DNN. It is implemented by ourselves. The tunable hyperparameters contain the hidden dimension of fully connected layers and the embedding dimension of node representation .

FM. The cloud platform we used has provided some frequently used machine learning algorithms including FM. The tunable hyperparameters include learning rate and coefficients of regularization .

Wide&Deep. It is implemented by ourselves according to cheng2016wide . The tunable hyperparameters contain the hidden dimension of fully connected layers in the deep component and the embedding dimension of node representation .

AutoInt. It is reproduced by using the source code111 published by  song2019autoint . The tunable hyperparameters contain the number of blocks and the number of attention heads . The hidden dimension of interacting layers is set to 32 as suggested by the original paper and the embedding dimension of node representation .


. It is reproduced by ourselves via TensorFlow according to the source code

222 published by  li2019fi . The tunable hyperparameters contain the highest order number of interactive features , and the embedding dimension of node representation .

AutoCross. We implement a special LR that either updates all the trainable parameters or updates only the parameters of newly added features. Then we implement a scheduler to trigger training and evaluation routines of the LR over different feature spaces. This fails to exploit the “reuse” trick proposed in luo2019autocross , but identically expresses their search strategy.

LR. We use the LR provided by the cloud platform, which is implemented based on parameter-server architecture. We set the regularization strength as 1.0 and regularization as 0. The maximum iteration is 100 and the toleration is 1e-6.

5.4 Efficiency Comparisons

To study the efficiency of the proposed method FIVES, we empirically compare it against other DNN-based methods in terms of both convergence rate and run-time per epoch. As Figure 7 shows, although we formulate the edge search task as a bilevel optimization problem, FIVES achieves comparable validation AUC at each different number of training steps, which indicates a comparable convergence rate. Meanwhile, the run-time per epoch of all the DNN-based methods is at the same order, even though, the run-time per epoch of both Fi-GNN and FIVES is relatively larger than that of others due to the complexity of graph convolutional operator. Actually, with 4 Nvidia GTX1080Ti GPU cards, FIVES can complete its search phase on Criteo (traversal of 3 epochs) within hours. In contrast, both dozens of hours and hundreds of times of computational resources are needed for search-based methods due to their trial-and-error nature.

Figure 7: Efficiency comparisons between FIVES and some DNN-based methods on Criteo.


5.5 Datasets Availability

Business1 and Business2 are constructed by randomly sampling large-scale search logs. Due to the privacy issue, they are not publicly available currently.


  • (1) Bengio, Y., A. Courville, P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • (2) Gers, F. A., N. N. Schraudolph, J. Schmidhuber. Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3(8):115–143, 2002.
  • (3) Cho, K., B. van Merriënboer, C. Gulcehre, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pages 1724–1734. 2014.
  • (4) Krizhevsky, A., I. Sutskever, G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pages 1097–1105. 2012.
  • (5) Vaswani, A., N. Shazeer, N. Parmar, et al. Attention is all you need. In Advances in neural information processing systems (NIPS), pages 5998–6008. 2017.
  • (6) Katz, G., E. C. R. Shin, D. Song. Explorekit: Automatic feature generation and selection. In Proceedings of the International Conference on Data Mining (ICDM), pages 979–984. 2016.
  • (7) Kaul, A., S. Maheshwary, V. Pudi. Autolearn—automated feature generation and selection. In Proceedings of the International Conference on Data Mining (ICDM), pages 217–226. 2017.
  • (8) Hutter, F., L. Kotthoff, J. Vanschoren. Automated Machine Learning. Springer, 2019.
  • (9) Kanter, J. M., K. Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA), pages 1–10. 2015.
  • (10) Luo, Y., M. Wang, H. Zhou, et al. Autocross: Automatic feature crossing for tabular data in real-world applications. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1936–1945. 2019.
  • (11) Rendle, S. Factorization machines. In Proceedings of the International Conference on Data Mining (ICDM), pages 995–1000. 2010.
  • (12) Juan, Y., Y. Zhuang, W.-S. Chin, et al. Field-aware factorization machines for ctr prediction. In Proceedings of the Conference on Recommender Systems (RecSys), pages 43–50. 2016.
  • (13) Shan, Y., T. R. Hoens, J. Jiao, et al. Deep crossing: Web-scale modeling without manually crafted combinatorial features. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), page 255–262. 2016.
  • (14) Guo, H., R. TANG, Y. Ye, et al. DeepFM: A factorization-machine based neural network for ctr prediction. In

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)

    , pages 1725–1731. 2017.
  • (15) Lian, J., X. Zhou, F. Zhang, et al. xDeepFM: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 1754–1763. 2018.
  • (16) Xu, K., W. Hu, J. Leskovec, et al. How powerful are graph neural networks? In International Conference on Learning Representations (ICLR). 2019.
  • (17) Li, Z., Z. Cui, S. Wu, et al. Fi-GNN: Modeling feature interactions via graph neural networks for ctr prediction. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 539–548. 2019.
  • (18) Song, W., C. Shi, Z. Xiao, et al. AutoInt: Automatic feature interaction learning via self-attentive neural networks. In Proceedings of the International Conference on Information and Knowledge Management (CIKM), pages 1161–1170. 2019.
  • (19) Liu, B., C. Zhu, G. Li, et al. Autofis: Automatic feature interaction selection in factorization models for click-through rate prediction. arXiv preprint arXiv:2003.11235, 2020.
  • (20) Liu, H., K. Simonyan, Y. Yang. DARTS: Differentiable architecture search. In International Conference on Learning Representations (ICLR). 2019.
  • (21) Noy, A., N. Nayman, T. Ridnik, et al. Asap: Architecture search, anneal and prune. arXiv preprint arXiv:1904.04123, 2019.
  • (22) Zhou, H., M. Yang, J. Wang, et al. BayesNAS: A Bayesian approach for neural architecture search. In Proceedings of the International Conference on Machine Learning (ICML), vol. 97, pages 7603–7613. 2019.
  • (23) Cai, H., L. Zhu, S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR). 2019.
  • (24) Bengio, Y., N. Léonard, A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • (25) Sutton, R. S., D. A. McAllester, S. P. Singh, et al.

    Policy gradient methods for reinforcement learning with function approximation.

    In Advances in neural information processing systems (NIPS), pages 1057–1063. 2000.
  • (26) Jang, E., S. Gu, B. Poole. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR). 2017.
  • (27) Nayman, N., A. Noy, T. Ridnik, et al. Xnas: Neural architecture search with expert advice. In Advances in neural information processing systems (NeurIPS), pages 1977–1987. 2019.
  • (28) Cheng, H.-T., L. Koc, J. Harmsen, et al.

    Wide & deep learning for recommender systems.

    In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10. 2016.