AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks

10/29/2018 ∙ by Weiping Song, et al. ∙ HEC Montréal Peking University 8

Click-through rate (CTR) prediction, which aims to predict the probability of a user clicking an ad or an item, is critical to many online applications such as online advertising and recommender systems. The problem is very challenging since (1) the input features (e.g., the user id, user age, item id, item category) are usually sparse and high-dimensional, and (2) an effective prediction relies on high-order combinatorial features (a.k.a. cross features), which are very time-consuming to hand-craft by domain experts and are impossible to be enumerated. Therefore, there have been efforts in finding low-dimensional representations of the sparse and high-dimensional raw features and their meaningful combinations. In this paper, we propose an effective and efficient algorithm to automatically learn the high-order feature combinations of input features. Our proposed algorithm is very general, which can be applied to both numerical and categorical input features. Specifically, we map both the numerical and categorical features into the same low-dimensional space. Afterward, a multi-head self-attentive neural network with residual connections is proposed to explicitly model the feature interactions in the low-dimensional space. With different layers of the multi-head self-attentive neural networks, different orders of feature combinations of input features can be modeled. The whole model can be efficiently fit on large-scale raw data in an end-to-end fashion. Experimental results on four real-world datasets show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good explainability.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

AutoInt

Implementation of AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks


view repo

AutoInt

code of recommend ctr:AutoInt Automatic Feature Interaction Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Predicting the probabilities of users clicking ads or items (a.k.a., click-through rate prediction) is a critical problem for many applications such as online advertising and recommender systems. The performance of the prediction has a direct impact on the final revenue of the business providers. Due to its importance, it has attracted growing interest in both academia and industry communities.

Machine learning has been playing a key role in click-through rate prediction, which is usually formulated as supervised learning with user profiles and item attributes as input features. The problem is very challenging for several reasons. First, the input features are extremely sparse and high-dimensional (McMahan et al., 2013; Shan et al., 2016a; He and Chua, 2017; Cheng et al., 2016; Guo et al., 2017)

. In real-world applications, a considerable percentage of user’s demographics and item’s attributes are usually discrete and/or categorical. To make supervised learning methods applicable, these features are first converted to a one-hot encoding vector, which can easily result in features with millions of dimensions. Taking the well-known CTR prediction data Criteo

111http://labs.criteo.com/2014/09/kaggle-contest-dataset-now-available-academic-use/ as an example, the feature dimension is approximately 30 million with sparsity over 99.99%. With such sparse and high-dimensional input features, the machine learning models are easily overfitted. Second, as shown in extensive literature (Cheng et al., 2016; Guo et al., 2017; Lian et al., 2018; Shan et al., 2016a), high-order feature combinations are crucial for a good performance. For example, it is reasonable to recommend Mario Bros., a famous video game, to David, who is a ten-year-old boy. In this case, the third-order combinatorial feature ¡Gender=Male, Age=10, ProductCategory=VideoGame¿ is very informative for prediction. However, finding such meaningful high-order combinatorial features heavily relies on domain experts. Moreover, it is almost impossible to hand-craft all the meaningful combinations (Rendle, 2010; Cheng et al., 2016). One may ask that we can enumerate all the possible high-order features and let machine learning models select the meaningful ones. However, enumerating all the possible high-order features will exponentially increase the dimension and sparsity of the input features, leading to a more serious problem of model overfitting. Therefore, there has been extensive efforts in the communities in finding low-dimensional representations of the sparse and high-dimensional input features and meanwhile modeling different orders of feature combinations.

For example, Factorization Machines (FM) (Rendle, 2010), which combine polynomial regression models with factorization techniques, are developed to model feature interactions and have been proved effective for various tasks (Rendle et al., 2011, 2010). However, limited by its polynomial fitting time, it is only effective for modeling low-order feature interactions and impractical to capture high-order feature interactions. Recently, many works (He and Chua, 2017; Cheng et al., 2016; Guo et al., 2017; Wang et al., 2017) based on deep neural networks have been proposed to model the high-order feature interactions. Specifically, multiple layers of non-linear neural networks are usually used to capture the high-order feature interactions. However, such kinds of methods suffer from two limitations. First, fully-connected neural networks have been shown inefficient in learning multiplicative feature interactions (Beutel et al., 2018). Second, since these models learn the feature interactions in an implicit way, they lack good explanation on which feature combinations are meaningful. Therefore, we are looking for an approach that is able to explicitly model different orders of feature combinations, represent the entire features into low-dimensional spaces, and meanwhile offer good model explainability.

In this paper, we propose such an approach based on the multi-head self-attention mechanism (Vaswani et al., 2017). Our proposed approach learns effective low-dimensional representations of the sparse and high-dimensional input features and is applicable to both the categorical and/or numerical input features. Specifically, both the categorical and numerical features are first embedded into low-dimensional spaces, which reduces the dimension of the input features and meanwhile allows different types of features to interact with each other via vector arithmetic (e.g., summation and inner product). Afterward, we propose a novel interacting layer to promote the interactions between different features. Within each interacting layer, each feature is allowed to interact with all the other features and is able to automatically identify relevant features to form meaningful higher-order features via the multi-head attention mechanism (Vaswani et al., 2017). Moreover, the multi-head mechanism projects a feature into multiple subspaces, and hence it can capture different feature interactions in different subspaces. Such an interacting layer models the one-step interaction between the features. By stacking multiple interacting layers, we are able to model different orders of feature interactions. In practice, the residual connection (He et al., 2016) is added to the interacting layer, which allows combining different orders of feature combinations. We use the attention mechanism for measuring the correlations between features, which offers good model explainability.

To summarize, in this paper we make the following contributions:

  • We propose to study the problem of explicitly learning high-order feature interactions and meanwhile finding models with good explainability for the problem.

  • We propose a novel approach based on self-attentive neural network, which can automatically learn high-order feature interactions and efficiently handle large-scale high-dimensional sparse data.

  • We conducted extensive experiments on several real-world data sets. Experimental results on the task of CTR prediction show that our proposed approach not only outperforms existing state-of-the-art approaches for prediction but also offers good model explainability.

Our work is organized as follows. In Section 2, we summarize the related work. Section 3 formally defines our problem. Section 4 presents the proposed approach to learn feature interactions. In Section 5, we present the experimental results and detailed analysis. We conclude this paper and point out the future work in Section 6.

2. Related work

Our work is relevant to three lines of work: 1) Click-through rate prediction in recommender systems and online advertising, 2) techniques for learning feature interactions, and 3) self-attention mechanism and residual networks in the literature of deep learning.

2.1. Click-through Rate Prediction

Predicting click-through rates is important to many Internet companies, and various systems have been developed by different companies (Richardson et al., 2007; Graepel et al., 2010; McMahan et al., 2013; He et al., 2014; Cheng et al., 2016; Covington et al., 2016; Zhou et al., 2017). For example, Google developed the Wide&Deep(Cheng et al., 2016) learning system for recommender systems, which combines the advantages of both the linear shallow models and deep models. The system achieves remarkable performance in APP recommendation. The problem also receives a lot of attention in the academic communities. For example, Shan et al. (2016b)

proposed a context-aware CTR prediction method which factorized three-way ¡user, ad, context¿ tensor.

Oentaryo et al. (2014) developed hierarchical importance-aware factorization machine to model dynamic impacts of ads.

2.2. Learning Feature Interactions

Learning feature interactions is a fundamental problem and therefore extensively studied in the literature. A well-known example is Factorization Machines (FM) (Rendle, 2010), which were proposed to mainly capture the first- and second-order feature interactions and have been proved effective for many tasks in recommender systems (Rendle et al., 2010, 2011). Afterward, different variants of factorization machines have been proposed. For example, Field-aware Factorization Machines (FFM) (Juan et al., 2016) modeled fine-grained interactions between features from different fields. GBFM (Cheng et al., 2014) and AFM (Xiao et al., 2017) considered the importance of different second-order feature interactions. However, all these approaches focus on modeling low-order feature interactions.

There are some recent works that model high-order feature interactions. For example, NFM (He and Chua, 2017) stacked deep neural networks on top of the output of the second-order feature interactions to model higher-order features. Similarly, PNN (Qu et al., 2016), FNN (Zhang et al., 2016), DeepCrossing (Shan et al., 2016a), Wide&Deep (Cheng et al., 2016) and DeepFM (Guo et al., 2017)

utilized feed-forward neural networks to model high-order feature interactions. However, all these approaches learn the high-order feature interactions in an implicit way and therefore lack good model explainability. On the contrary, there are three lines of works that learn feature interactions in an explicit fashion. First, Deep&Cross 

(Wang et al., 2017) and xDeepFM (Lian et al., 2018) took outer product of features at the bit- and vector-wise level respectively. Although they perform explicit feature interactions, it is not trivial to explain which combinations are useful. Second, some tree-based methods (Zhu et al., 2017; Zhao et al., 2017; Wang et al., 2018) combined the power of embedding-based models and tree-based models but had to break training procedure into multiple stages. Third, HOFM (Blondel et al., 2016a) proposed efficient training algorithms for high-order factorization machines. However, HOFM requires too many parameters and only its low-order (usually less than 5) form can be practically used. We compare with all these methods in the experiments.

2.3. Attention and Residual Networks

Our proposed model makes use of the latest techniques in the literature of deep learning: attention (Bahdanau et al., 2014) and residual networks (He et al., 2016)

. Attention is first proposed in the context of neural machine translation 

(Bahdanau et al., 2014) and has been proved effective in a variety of tasks such as question answering (Sukhbaatar et al., 2015)

, text summarization 

(Rush et al., 2015), and recommender systems (Zhou et al., 2017). Vaswani et al. (2017) further proposed multi-head self-attention to model complicated dependencies between words in machine translation.

Residual networks (He et al., 2016)

achieved state-of-the-art performance in the ImageNet contest. Since the residual connection, which can be simply formalized as

, encourages gradient flow through interval layers, it becomes a popular network structure for training very deep neural networks.

3. Problem Definition

We first formally define the problem of click-through rate (CTR) prediction as follows:

DEFINITION 1. (CTR Prediction) Let denotes the concatenation of user ’s features and item ’s features, where categorical features are represented with one-hot encoding, and is the dimension of concatenated features. The problem of click-through rate prediction aims to predict the probability of user clicking item according to the feature vector .

A straightforward solution for CTR prediction is to treat

as the input features and deploy the off-the-shelf classifiers such as logistic regression. However, since the original feature vector

is very sparse and high-dimensional, the model will be easily overfitted. Therefore, it is desirable to represent the raw input features in low-dimensional continuous spaces. Moreover, as shown in existing literature, it is crucial to utilize the higher-order combinatorial features to yield good prediction performance (Rendle, 2010; Cheng et al., 2016; Shan et al., 2016a; Novikov et al., 2016; Guo et al., 2017; Blondel et al., 2016b). Specifically, we define the high-order combinatorial features as follows:

DEFINITION 2. (p-order Combinatorial Feature) Given input feature vector , a p-order combinatorial feature is defined as , where each feature comes from a distinct field, is the number of involved feature fields, and can be any combination function, such as multiplication (Rendle, 2010) and outer product (Lian et al., 2018; Wang et al., 2017).

Traditionally, meaningful high-order combinatorial features are hand-crafted by domain experts. However, this is very time-consuming and hard to generalize to other domains. Besides, it is almost impossible to hand-craft all meaningful high-order features. Therefore, we aim to develop an approach that is able to automatically discover the meaningful high-order combinatorial features and meanwhile map all these features into low-dimensional continuous spaces. Formally, we define our problem as follows:

DEFINITION 3. (Problem Definition) Given an input feature vector for click-through rate prediction, our goal is to learn a low-dimensional representation of , which models the high-order combinatorial features.

4. Model

In this section, we first give an overview of the proposed approach AutoInt, which can automatically learn feature interactions for CTR prediction. Next, we present a comprehensive description of how to learn a low-dimensional representation that models high-order combinatorial features without manual feature engineering.

Figure 1. Overview of our proposed model AutoInt. The details of embedding layer and interacting layer are illustrated in Figure 2 and Figure 3 respectively.

4.1. Overview

The goal of our approach is to map the original sparse and high-dimensional feature vector into low-dimensional spaces and meanwhile model the high-order feature interactions. As shown in Figure 1, our proposed method takes the sparse feature vector as input, followed by an embedding layer that projects all features (i.e., both categorical and numerical features) into the same low-dimensional space. Next, we feed embeddings of all fields into a novel interacting layer, which is implemented as a multi-head self-attentive neural network. For each interacting layer, high-order features are combined through the attention mechanism, and different kinds of combinations can be evaluated with the multi-head mechanisms, which map the features into different subspaces. By stacking multiple interacting layers, different orders of combinatorial features can be modeled.

The output of the final interacting layer is the low-dimensional representation of the input feature, which models the high-order combinatorial features and is further used for estimating the click-through rate through a sigmoid function. Next, we introduce the details of our proposed method.

4.2. Input Layer

We first represent user’s profiles and item’s attributes as a sparse vector, which is the concatenation of all fields. Specifically,

(1)

where is the number of total feature fields, and is the feature representation of the -th field. is a one-hot vector if the -th field is categorical (e.g., in Figure 2). is a scalar value if the -th field is numerical (e.g., in Figure 2).

Figure 2. Illustration of input and embedding layer, where both categorical and numerical fields are represented by low-dimensional dense vectors.

4.3. Embedding Layer

Since the feature representations of the categorical features are very sparse and high-dimensional, a common way is to represent them into low-dimensional spaces (e.g., word embeddings). Specifically, we represent each categorical feature with a low-dimensional vector, i.e.,

(2)

where is an embedding matrix for field , and is an one-hot vector.

To allow the interaction between categorical and numerical features, we also represent the numerical features in the same low-dimensional feature space. Specifically, we represent the numerical feature as

(3)

where is an embedding vector for field , and is a scalar value.

By doing this, the output of the embedding layer would be a concatenation of multiple embedding vectors, as presented in Figure 2.

4.4. Interacting Layer

Once the numerical and categorical features live in the same low-dimensional space, we move to model high-order combinatorial features in the space. The key problem is to determine which features should be combined to form meaningful high-order features. Traditionally, this is accomplished by domain experts who create meaningful combinations based on their knowledge. In this paper, we tackle this problem with a novel method, the multi-head self-attention mechanism (Vaswani et al., 2017).

Multi-head self-attentive network (Vaswani et al., 2017) has recently achieved remarkable performance in modeling complicated relations. For example, it shows superiority for modeling arbitrary word dependency in machine translation (Vaswani et al., 2017) and sentence embedding (Lin et al., 2017), and has been successfully applied to capturing node similarities in graph embedding (Velickovic et al., 2017). Here we extend this latest technique to model the correlations between different feature fields.

Specifically, we adopt the key-value attention mechanism (Miller et al., 2016) to determine which feature combinations are meaningful. Taking the feature as an example, next we explain how to identify multiple meaningful high-order features involving feature . We first define the correlation between feature and feature under a specific attention head as follows:

(4)

where is an attention function which defines the similarity between the feature and . It can be defined as a neural network or as simple as inner product, i.e., . In this work, we use inner product due to its simplicity and effectiveness. , in Equation 4 are transformation matrices which map the original embedding space into a new space . Next, we update the representation of feature in subspace via combining all relevant features guided by coefficients :

(5)

where .

Since is a combination of feature and its relevant features (under head ), it represents a new combinatorial feature learned by our method. Furthermore, a feature is also likely to be involved in different combinatorial features, and we achieve this by using multiple heads, which create different subspaces and learn distinct feature interactions separately. We collect combinatorial features learned in all subspaces as follows:

Figure 3. The architecture of interacting layer. Combinatorial features are conditioned on attention weights, i.e., .
(6)

where is the concatenation operator, and H is the number of total heads.

To preserve previously learned combinatorial features, including raw individual features, we add standard residual connections in our network. Formally,

(7)

where is the projection matrix in case of dimension mismatching (He et al., 2016), and

is a non-linear activation function.

With such an interacting layer, the representation of each feature will be updated into a new feature representation , which is a representation of high-order features. We can stack multiple such layers with the output of the previous interacting layer as the input of the next interacting layer. By doing this, we can model arbitrary-order combinatorial features.

Time Complexity Analysis. The main cost of one-step feature interaction is two-fold. First, calculating attention weights for one head takes time. Afterward, forming combinatorial features under one head also takes time. Because we have heads, it takes time altogether. It is therefore efficient because and are usually small.

4.5. Output Layer

The output of the interacting layer is a set of feature vectors , which includes raw individual features reserved by residual block and combinatorial features learned via the multi-head self-attention mechanism. For final CTR prediction, we simply concatenate all of them and then apply a non-linear projection as follows:

(8)

where is a column projection vector which linearly combines concatenated features, is the bias, and predicts the click-through rate.

4.6. Training

Our loss function is

Log loss, which is defined as follows:

(9)

where and are ground truth of user clicks and estimated CTR respectively, indexes the training samples, and is the total number of training samples. The parameters to learn in our model are {, , , , , , , }, which are updated via minimizing the total Logloss using gradient descent.

5. Experiment

In this section, we move forward to evaluate the effectiveness of our proposed approach. We aim to answer the following questions:

  • [leftmargin=6ex,labelsep=1ex]

  • How does our proposed AutoInt perform on the problem of CTR prediction? Is it efficient for large-scale sparse and high-dimensional data?

  • What are the influences of different model configurations?

  • What are the dependency structures between different features? Is our proposed model explainable?

  • Will integrating implicit feature interactions further improve the performance?

We first describe the experimental settings before answering these questions.

5.1. Experiment Setup

5.1.1. Data Sets

We use four public real-world data sets. The statistics of the data sets are summarized in Table 1.

Criteo222https://www.kaggle.com/c/criteo-display-ad-challenge This is a benchmark dataset for CTR prediction, which has 45 million users’ click records on displayed ads. It contains 26 categorical feature fields and 13 numerical feature fields.

Avazu333https://www.kaggle.com/c/avazu-ctr-prediction This dataset contains users’ mobile behaviors including whether a displayed mobile ad is clicked by a user or not. It has 23 feature fields spanning from user/device features to ad attributes.

KDD12444https://www.kaggle.com/c/kddcup2012-track2 This data set was released by KDDCup 2012, which originally aimed to predict the number of clicks. Since our work focuses on CTR prediction rather than the exact number of clicks, we treat this problem as a binary classification problem (1 for clicks¿0, 0 for without click), which is similar to FFM (Juan et al., 2016).

MovieLens-1M555https://grouplens.org/datasets/movielens/

This dataset contains users’ ratings on movies. During binarization, we treat samples with a rating less than 3 as negative samples because a low score indicates that the user does not like the movie. We treat samples with a rating greater than 3 as positive samples and remove neutral samples, i.e., a rating equal to 3.


Data Preparation First, we remove the infrequent features (appearing in less than threshold instances) and treat them as a single feature “¡unknown¿”, where threshold

is set to {10, 5, 10} for Criteo, Avazu and KDD12 data sets respectively. Second, since numerical features may have large variance and hurt machine learning algorithms, we normalize numerical values by transforming a value

to if , which is proposed by the winner of Criteo Competition666https://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf. Third, we randomly select 80% of all samples for training and randomly split the rest into validation and test sets of equal size.

Data #Samples #Fields #Features (Sparse)
Criteo 45,840,617 39 998,960
Avazu 40,428,967 23 1,544,488
KDD12 149,639,105 13 6,019,086
MovieLens-1M 739,012 7 3,529
Table 1. Statistics of evaluation data sets.
Model Class Model Criteo Avazu KDD12 MovieLens-1M
AUC Logloss AUC Logloss AUC Logloss AUC Logloss
First-order LR 0.7820 0.4695 0.7560 0.3964 0.7361 0.1684 0.7716 0.4424
Second-order FM (Rendle, 2010) 0.7836 0.4700 0.7706 0.3856 0.7759 0.1573 0.8252 0.3998
AFM(Xiao et al., 2017) 0.7938 0.4584 0.7718 0.3854 0.7659 0.1591 0.8227 0.4048
High-order DeepCrossing (Shan et al., 2016a) 0.8012 0.4513 0.7643 0.3889 0.7715 0.1591 0.8453 0.3814
NFM (He and Chua, 2017) 0.7957 0.4562 0.7708 0.3864 0.7515 0.1631 0.8357 0.3883
CrossNet (Wang et al., 2017) 0.7907 0.4591 0.7667 0.3868 0.7773 0.1572 0.7968 0.4266
CIN (Lian et al., 2018) 0.8009 0.4517 0.7758 0.3829 0.7800 0.1566 0.8286 0.4108
HOFM (Blondel et al., 2016a) 0.8005 0.4508 0.7701 0.3854 0.7707 0.1586 0.8304 0.4013
AutoInt (ours) 0.8061 0.4454 0.7752 0.3823 0.7881 0.1545 0.8460 0.3784
Table 2. Effectiveness Comparison of Different Algorithms. We highlight that our proposed model almost outperforms all baselines across four data sets and both metrics. Further analysis is provided in Section 5.2.

5.1.2. Evaluation Metrics

We use two popular metrics to evaluate the performance of all methods.

AUC Area Under the ROC Curve (AUC) measures the probability that a CTR predictor will assign a higher score to a randomly chosen positive item than a randomly chosen negative item. A higher AUC indicates a better performance.

Logloss Since all models attempt to minimize the Logloss defined by Equation 9, we use it as a straightforward metric.

It is noticeable that a slightly higher AUC or lower Logloss at 0.001-level is regarded significant for CTR prediction task, which has also been pointed out in existing works (Cheng et al., 2016; Guo et al., 2017; Wang et al., 2017).

5.1.3. Competing Models

We compare the proposed approach with three classes of previous models. (A) the linear approach that only uses individual features. (B) factorization machines-based methods that take into account second-order combinatorial features. (C) techniques that can capture high-order feature interactions. We associate the model classes with model names accordingly.

LR (A). LR only models the linear combination of raw individual features.

FM (Rendle, 2010) (B). FM uses factorization techniques to model second-order feature interactions.

AFM (Xiao et al., 2017) (B). AFM777https://github.com/sunchenglong/attentional_factorization_machine is one of the state-of-the-art models that capture second-order feature interactions. It extends FM by using attention mechanism to distinguish the different importance of second-order combinatorial features.

DeepCrossing (Shan et al., 2016a) (C). DeepCrossing utilizes deep fully-connected neural networks with residual connections to learn non-linear feature interactions in an implicit fashion.

NFM (He and Chua, 2017) (C). NFM888https://github.com/SyncWorld/neural_factorization_machine stacks deep neural networks on top of second-order feature interaction layer. High-order feature interactions are captured implicitly by the non-linear activations of neural networks.

CrossNet (Wang et al., 2017) (C). Cross Network, which is the core of Deep&Cross model, takes outer product of concatenated feature vector at the bit-wise level to model feature interactions explicitly.

CIN (Lian et al., 2018) (C). Compressed Interaction Network, which is the core of xDeepFM model, takes outer product of stacked feature matrix at vector-wise level.

We will compare with the full models of CrossNet and CIN, i.e., Deep&Cross and xDeepFM, in a joint training setting later.

HOFM (Blondel et al., 2016a) (C). HOFM proposes efficient kernel-based algorithms for training high-order factorization machines. Follow settings in Blondel et al. (Blondel et al., 2016a) and He and Chua (He and Chua, 2017), we build a third-order factorization machine using public implementation999https://github.com/geffy/tffm.

(a) Criteo
(b) Avazu
(c) KDD12
(d) MovieLens-1M
Figure 4. Efficiency Comparison of Different Algorithms in terms of Run Time. “DC” and “CN” are DeepCrossing and CrossNet for short, respectively. Since HOFM cannot be fit on one GPU card for the KDD12 dataset, extra communication cost makes it most time-consuming. Further analysis is presented in Section 5.2.

5.1.4. Implementation Details

101010We will release our code upon publishing.

All methods are implemented in TensorFlow

(Abadi et al., 2016). We use an embedding dimension of 16 and batch size of 1024 for all methods. Hidden units is set to 32. We use Adam (Kingma and Ba, 2014) to optimize all deep neural network-based models. DeepCrossing has four feed-forward layers, each with 100 hidden units. We use one hidden layer of size 200 on top of Bi-Interaction layer for NFM as recommended by their paper. For CN and CIN, we use three interaction layers consistently. To prevent overfitting, we add dropout(Srivastava et al., 2014) with ratio 0.5 for a small dataset, i.e., MovieLens-1M, and we found it not necessary for other three large data sets. Except for special mention, we stack three interacting layers in the following experiments and use two attention heads in each layer.

5.2. Quantitative Results (RQ1)

Evaluation of Effectiveness
The performance of different algorithms is summarized in Table 2. We have the following observations:

(1) FM and AFM, which explore second-order feature interactions, consistently outperform LR by a large margin on all datasets, which indicates that individual features are insufficient in CTR prediction.

(2) An interesting observation is the inferiority of some models which capture high-order feature interactions. For example, although DeepCrossing and NFM use the deep neural network as a core component to learning high-order feature interactions, they do not guarantee improvement over FM and AFM. This may attribute to the fact that they learn feature interactions in an implicit fashion. On the contrary, CIN does it explicitly and outperforms low-order models consistently.

(3) HOFM outperforms FM in most cases except for KDD12 dataset, which indicates that modeling third-order feature interactions is beneficial to prediction performance.

(4) AutoInt achieves the best performance overall baseline methods on three of four real-world data sets. On Avazu data set, CIN performs a little better than AutoInt in AUC evaluation, but we get lower Logloss. Note that our proposed AutoInt shares the same structures as DeepCrossing except the feature interacting layer, which indicates using the attention mechanism to learn explicit combinatorial features is crucial.

Evaluation of Model Efficiency
We present the runtime results of different algorithms on four data sets in Figure 4. Unsurprisingly, LR is the most efficient algorithm due to its simplicity. FM and NFM perform similarly in terms of runtime because NFM only stacks a single feed-forward hidden layer on top of the second-order interaction layer. Among all listed methods, CIN, which achieves the best performance for prediction among all the baselines, is much more time-consuming due to its complicated crossing layer. This may make it impractical in the industrial scenarios. Note that AutoInt is sufficiently efficient, which is comparable to the efficient algorithms DeepCrossing and NFM.

We also compare the sizes of different models (i.e., the number of parameters) as another criterion for efficiency evaluation. As shown in Table 3, comparing to the best model CIN in the baseline models, the number of parameters in AutoInt is much smaller.

To summarize, our proposed AutoInt achieves the best performance among all the compared models. Compared to the most competitive baseline model CIN, AutoInt requires much fewer parameters and is much more efficient during online inference.

Model DC CN CIN NFM AutoInt
#Params
Table 3. Efficiency Comparison of Different Algorithms in terms of Model Size on Criteo data set. “DC” and “CN” are DeepCrossing and CrossNet for short, respectively. We only take the parameters above the embedding layer into account.
Data Sets Models AUC Logloss
Criteo AutoInt 0.8061 0.4454
AutoInt 0.8033 0.4478
Avazu AutoInt 0.7752 0.3823
AutoInt 0.7729 0.3836
KDD12 AutoInt 0.7888 0.1545
AutoInt 0.7831 0.1557
MovieLens-1M AutoInt 0.8460 0.3784
AutoInt 0.8299 0.3959
Table 4. Ablation study comparing the performance of AutoInt with and without residual connections. AutoInt is the complete model while the AutoInt is the model without residual connection.

5.3. Analysis (RQ2)

To further validate and gain deep insights into the proposed model, we conduct ablation study and compare several variants of AutoInt.

5.3.1. Influence of Residual Structure

The standard AutoInt utilizes residual connections, which carry through all learned combinatorial features and therefore allow modeling very high-order combinations. To justify the contribution of residual units, we tease apart them from our standard model and keep other structures as they are. As presented in Table 4, we observe that the performance decrease on all datasets if residual connections are removed. Specifically, the full model outperforms the variant by a large margin on the KDD12 and MovieLens-1M data, which indicates residual connections are crucial to model high-order feature interactions in our proposed method.

5.3.2. Influence of Network Depths

Our model learns high-order feature combinations by stacking multiple interacting layers (introduced in Section 4). Therefore, we are interested in how the performance change w.r.t. the number of interacting layers, i.e., the order of combinatorial features. Note that when there is no interacting layer (i.e., Number of layers equals zero), our model takes the weighted sum of raw individual features as input, i.e., no combinatorial features are considered.

The results are summarized in Figure 5. We can see that if one interacting layer is used, i.e., feature interactions are taken into account, the performance increase dramatically on both data sets, showing that combinatorial features are very informative for prediction. As the number of interacting layers further increases, i.e., higher-order combinatorial features are taken into account, the performance of the model further increases. When the number of layers reaches three, the performance becomes stable, showing that adding extremely high-order features are not informative for prediction.

Model Criteo Avazu KDD12 MovieLens-1M Avg. Changes
AUC Logloss AUC Logloss AUC Logloss AUC Logloss AUC Logloss
Wide&Deep (LR) 0.8026 0.4494 0.7749 0.3824 0.7549 0.1619 0.8300 0.3976 +0.0292 -0.0213
DeepFM (FM) 0.8066 0.4449 0.7751 0.3829 0.7867 0.1549 0.8437 0.3846 +0.0142 -0.0113
Deep&Cross (CN) 0.8067 0.4447 0.7731 0.3836 0.7869 0.1550 0.8446 0.3809 +0.0199 -0.0164
xDeepFM (CIN) 0.8070 0.4447 0.7768 0.3832 0.7820 0.1560 0.8467 0.3800 +0.0068 -0.0068
AutoInt+ (ours) 0.8080 0.4437 0.7771 0.3811 0.7892 0.1544 0.8486 0.3757 +0.0019 -0.0014
Table 5. Results of Integrating Implicit Feature Interactions. We indicate the base model behind each method. The last two columns are average changes of AUC and Logloss compared to corresponding base models (“+”: increase, “-”: decrease).
(a) AUC
(b) Logloss
Figure 5. Performance w.r.t. the number of interacting layers. Results on Criteo and Avazu data sets are similar and hence omitted.
(a) AUC
(b) Logloss
Figure 6. Performance w.r.t. number of embedding dimensions. Results on Criteo and Avazu data sets are similar and hence omitted.

5.3.3. Influence of Different Dimensions

Next, we investigate the performance w.r.t. the parameter , which is the output dimension of the embedding layer. On the KDD12 dataset, we can see that the performance continuously increase as we increase the dimension size since larger models are used for prediction. The results are different on the MovieLens-1M dataset. When the dimension size reaches 24, the performance begins to decrease. The reason is that this data set is small, and the model is overfitted when too many parameters are used.

(a) Label=1, Predicted CTR=0.89
(b) Overall feature interactions
Figure 7. Heat maps of attention weights for both case- and global-level feature interactions on MovieLens-1M. The axises represent feature fields ¡Gender, Age, Occupation, Zipcode, RequestTime, RealeaseTime, Genre¿. We highlight some learned combinatorial features in rectangles.

5.4. Explainable Recommendations (RQ3)

A good recommender system can not only provide good recommendations but also offer good explainability. Therefore, in this part, we present how our AutoInt is able to explain the recommendation results. We take the MovieLens-1M dataset as an example.

Let’s look at a recommendation result suggested by our algorithm, i.e., a user likes an item. Figure 7 (a) presents the correlations between different fields of input features, which are obtained by the attention score. We can see that AutoInt is able to identify the meaningful combinatorial feature ¡Gender=Male, Age=[18-24), MovieGenre=Action&Triller¿ (i.e., red dotted rectangle). This is very reasonable since young men are very likely to prefer action&triller movies.

We are also interested in what the correlations between different feature fields in the data are. Therefore, we measure the correlations between the feature fields according to their average attention score in the entire data. The correlations between different fields are summarized into Figure 7 (b). We can see that ¡Gender, Genre¿, ¡Age, Genre¿, ¡RequestTime, ReleaseTime¿ and ¡Gender, Age, Genre¿ (i.e., solid green region) are strongly correlated, which are the explainable rules for recommendation in this domain.

5.5. Integrating Implicit Interactions (RQ4)

Feed-forward neural networks are capable of modeling implicit feature interactions and have been widely integrated into existing CTR prediction methods (Cheng et al., 2016; Guo et al., 2017; Lian et al., 2018). To investigate whether integrating implicit feature interactions further improves the performance, we combine AutoInt with a two-layer feed-forward neural network by joint training. We name the joint model AutoInt+ and compare it with the following algorithms:

  • Wide&Deep (Cheng et al., 2016). Wide&Deep integrates the outputs of logistic regression and feed-forward neural networks.

  • DeepFM (Guo et al., 2017). DeepFM combines FM and feed-forward neural network, with a shared embedding layer.

  • Deep&Cross (Wang et al., 2017). Deep&Cross is the extension of CrossNet by integrating feed-forward neural networks.

  • xDeepFM (Lian et al., 2018). xDeepFM is the extension of CIN by integrating feed-forward neural networks.

Table 5 presents the evaluation results of joint-training models. We have the following observations: 1) the performance of our method improves by joint training with feed-forward neural networks on all datasets. This indicates that integrating implicit feature interactions indeed boosts the predictive ability of our proposed model. However, as can be seen from last two columns, the magnitude of performance improvement is fairly small compared to other models, showing that our individual model AutoInt is quite powerful. 2) after integrating implicit feature interactions, AutoInt+ outperforms all competitive methods, and achieves new state-of-the-art performances on used CTR prediction data sets.

6. Conclusion

In this work, we propose a novel CTR prediction model based on self-attention mechanism, which can automatically learn high-order feature interactions in an explicit fashion. The key to our method is the newly-introduced interacting layer, which allows each feature to interact with the others and to determine the relevance through learning. Experimental results on four real-world data sets demonstrate the effectiveness and efficiency of our proposed model. Besides, we provide good model explainability via visualizing the learned combinatorial features. When integrating with implicit feature interactions captured by feed-forward neural networks, we achieve better offline AUC and Logloss scores compared to the previous state-of-the-art methods. In the future, we are interested in incorporating contextual information into our method and improving its performance for online recommender systems.

References