As one of the largest B2C e-commerce platforms in China, JD.com also powers a leading advertising system, serving millions of advertisers with fingertip connection to hundreds of millions of customers. Every day, customers visit JD, click ads and leave billions of interaction logs. These data not only feed the learning system, but also boost technique revolutions that keep lifting both user experience and advertisers’ profits on JD.com.
In the commonly used cost-per-click (CPC) advertising system, ads are ranked by effective cost per mile (eCPM), the product of bid price given by advertisers and the CTR predicted by the ad system. Accurate CTR prediction benefits both business effectiveness and user experience. Thus, this topic has attracted widespread interest in both machine learning academia and e-commerce industry.
Nowadays, most of the ads on e-commerce platforms are displayed with images, since they are more visually appealing and convey more details compared to textual descriptions. An interesting observation is that many ads get significantly higher CTR by only switching to more attractive images. This motivates a variety of emerging studies on extracting expressive visual features for CTR prediction (Chen et al., 2016; Mo et al., 2015). These algorithms adopt various off-the-shelf CNNs to extract visual features and fuse them with non-visual features (e.g. Category, user) for the final CTR prediction. With the additional visual features, these algorithms significantly outperform their non-visual counterparts in offline-experiments and generalize well to cold and long-tailed ads. Although encouraging progress has been made in offline studies, applying CNN in real online advertising systems remains non-trival. The offline end-to-end training with CNN must be efficient enough to follow the time-varying online distribution, and the online serving needs to meet the low latency requirements of advertising system.
Furthermore, we notice that visual feature extraction in e-commerce is significantly different from the image classification setting where off-the-shelf CNNs were originally proposed. In classification, categories are regarded as the target to predict. While in e-commerce, categories of ads are clearly labeled, which contain abundant visual priors and will intuitively help visual modeling. Some academic studies have integrated the categorical information by building category-specific projecting matrix on top of the CNN embeddings(He et al., 2016b) and by explicitly decomposing visual features into styles and categories (Liu et al., 2017). These studies share a common architecture: the late fusion of visual and categorical knowledge, which however, is sub-optimal for CTR prediction. Namely, the image embedding modules seldom take advantage of the categorical knowledge. Unaware of the ad category, the embedding extracted by these CNNs may contain unnecessary features not related to this category, wasting CNN’s limited expression ability. In contrast, if the ad category is integrated, CNN only needs to focus on the category-specific patterns, which will ease the training process.
To overcome the industrial challenges, we build optimized infrastructure for both efficient end-to-end CNN training and low latency online servicing. Base on this efficient infrastructure, we propose Category-specific CNN (CSCNN) specially for the CTR prediction task, to fully utilize the labeled categories in e-commerce. Our key idea is to incorporate the ad category knowledge into the CNN in an early-fusion manner. Inspired by the SE-net (Hu et al., 2018b) and CBAM (Woo et al., 2018) which model the inter-dependencies between convolutional features with a light-weighted self-attention module, CSCNN further incorporates the ad category knowledge and performs a category-specific feature recalibration, as shown in Fig. 2. More clearly, we sequentially apply category-specific channel and spatial attention modules to emphasize features that are both important and category related. These expressive visual features contribute to significant performance gain in the CTR prediction problem.
In summary, we make the following contributions:
To the best of our knowledge, we are the first to high-light the negative impact of late fusion of visual and non-visual features in visual-aware CTR prediction.
We propose CSCNN, a novel visual embedding module specially for CTR prediction. The key idea is to conduct category-specific channel and spatial self-attention to emphasize features that are both important and category related.
We validate the effectiveness of CSCNN through extensive offline experiments and Online A/B test. We verify that the performance of various self-attention mechanisms and network backbones are consistently improved by plugging CSCNN.
We build highly efficient infrastructure to apply CNN in the real online e-commerce advertising system. Effective acceleration methods are introduced to accomplish the end-to-end training with CNN on the 10 billion scale real production dataset within 24 hours, and meet the low latency requirements of online system (20ms on CPU). CSCNN has now been deployed in the search advertising system of JD.com, one of the largest B2C e-commerce platform in China, serving the main traffic of hundreds of millions of active users.
2. Related Work
Our work is closely related to two active research areas: CTR prediction and attention mechanism in CNN.
2.1. CTR Prediction
The aim of CTR prediction is to predict the probability that a user clicks an ad given certain contexts. Accurate CTR prediction benefits both user experience and advertiser’s profits thus is of crucial importance to the e-commerce industry.
Pioneer works in CTR or user preference prediction are based on linear regression (LR)(McMahan et al., 2013), matrix factorization (MF) (Wang et al., 2013)
and decision trees(He et al., 2014)
. Recent years have witnessed many successful applications of deep learning in CTR prediction(Cheng et al., 2016; Wang et al., 2017). Early works usually only make use of non-visual features, which however, is insufficient nowadays. Currently, most ads are displayed with images which contain plentiful visual details and largely affect users’ preference. This motivates many emerging studies on visual aware CTR prediction (Chen et al., 2016; He and McAuley, 2016; Kang et al., 2017; He et al., 2016b; Liu et al., 2017; Yang et al., 2019; Zhao et al., 2019). They first extract visual features using various off-the-shelf CNNs, and then fuse the visual features with non-visual features including categories, to build the preference predictor. Unfortunately, this late fusion is actually sub-optimal or even a waste in e-commerce scenario, where categories are clearly labeled and contain abundant visual priors that may somehow help visual feature extraction.
In contrast to existing works with late fusion, CSCNN differs fundamentally in early incorporating categorical knowledge into the convolutional layers, allowing easy category-specific inter-channel and inter-spatial dependency learning.
|class label||sigmoid||predicted CTR|
|dateset||real number set||feature dimension|
|loss||feature vector||prediction function|
|layer index||non-visual features|
|hidden layer||visual features|
|cross layer||cross layer parameter|
|embedded feature||embedding dictionary|
|vocabulary size||one/multi hot coding|
|ad image||ad category|
|# layers||spatial category prior|
|# channels||channel category prior|
|height||original feature map|
|attention map||refined feature map|
|size of||element-wise product|
2.2. Attention Mechanism in CNN
Attention mechanism is an important feature selection approach that helps CNN to emphasize important parts of feature maps and suppress unimportant ones. Spatial attention tellswhere (Woo et al., 2018) and channel-wise attention tells what to focus on (Hu et al., 2018b).
In literature, many works have attempted to learn the attention weights from the feature map, termed self-attention. State-of-the-art algorithms include CBAM (Woo et al., 2018), SE (Hu et al., 2018b) among others (Hu et al., 2018a; Gao et al., 2019). Besides self attention, the attention weights can also be conditioned on external information, for example nature language. Successful application fields include search by language (Li et al., 2017)et al., 2015; Chen et al., 2017) and visual question answering (Yang et al., 2016).
Our work is motivated by the attention mechanism. Rather than vision & language, we design novel architectures to adapt attention mechanism to address an important but long overlooked issue, the sub-optimal late fusion of vision and non-vision features in CTR prediction. We combine the advantages of both self-attention and attention conditioned on external information, namely the ad category. As a result, our image embedding is able to emphasize features that are both important and category related.
3. The CTR Prediction System in JD.com
We first review the background of the CTR prediction in Section 3.1. Then we describe the architecture of our CTR prediction system in Section 3.2. We further dig into details of our novel visual modeling module, the Category-Specific CNN in Section 3.3. Finally, we introduce essential accelerating strategies for online deployment in Section 3.4. The notations are summarized in Table 1.
In the online advertising industry, when an ad is shown to a user under some contexts, this scenario is counted as an impression. The aim of CTR prediction is to predict the probability that a positive feedback, i.e. click, takes place in an impression (ad, user, contexts). Accurate CTR prediction directly benefits both the user experience and business effectiveness, which makes this task of crucial importance to the whole advertising industry.
CTR prediction is usually formulated as binary classification. Specifically, the goal is to learn a prediction function from a training set , where is the feature vector of the -th impression and is the class label that denotes whether a click takes place.
The objective function is defined as the negative log-likelihood:
where is the predicted CTR, scaled to by sigmoid :
3.2. The Architecture of CTR Prediction System
We now describe the architecture of our CTR prediction system that is serving on JD.com. Details are shown in Fig 1.
3.2.1. Deep & Cross Network
Deep & Cross network (DCN) (Wang et al., 2017) has achieved promising performance thanks to the ability to learn effective feature interactions. Here, we properly modify the DCN to take two inputs, a non-visual feature vector and a visual feature vector .
The visual feature is incorporated to the deep net. In layer 1, we transform the non-visual feature to 1024 dimension and concatenate it with the visual feature,
Two deep layers follows,
The cross net is used to process non-visual feature,
where the input , for layer .
Finally, we combine the outputs for the predicted CTR,
3.2.2. Non-visual Feature Embedding
We now describe the embedding layer that transforms raw non-visual features of an impression, namely (ad, user, contexts), to the vector .
We assume that all features come in the categorical form (after preprocessings e.g. binning). Usually, a categorical feature is encoded in a one-hot / multi-hot vector , where is the vocabulary size of this feature. We show two examples below:
TitleWords=[Summer, Dress] […,0,1,0,…,0,1,0…]
Unfortunately, this one/multi-hot coding is not applicable to industrial systems due to the extreme high dimensionality and sparsity. We thus adopt a low dimensional embedding strategy in our system,
where is the embedding dictionary for this specific feature and is the embedding size. We then concatenate the ’s of all features to build .
In practice, our system makes use of 95 non-visual features from users (historical clicks /purchases, location etc.), ads (category, title, # reviews etc.) and rich contexts (query words, visit time etc.) with 7 billion vocabularies in total. Setting , the total dimension is . We will further introduce the features and their statistics in Table 8 Appendix B.
3.3. Category-Specific CNN
Conventional CTR prediction systems mostly embed ad images using off-the-shelf CNNs. We say off-the-shelf since they were originally designed for classification, not for CTR prediction. They regard the image category as the target to predict, not as inputs. This is actually a huge waste on e-commerce platforms, where categories are precisely labeled and contain plentiful visual prior knowledge that would help visual modeling.
We address this issue by proposing a novel CNN specifically for CTR prediction, Category-Specific CNN, that embeds an ad image , together with the ad category , to the visual feature . Specifically, the category prior knowledge is encoded as category embeddings (trained jointly with the CTR model) and incorporated to the CNN using a conditional attention mechanism.
Theoretically, CSCNN can be adopted to any convoluation layer in any network. In our systems, we plug CSCNN on ResNet18 (He et al., 2016a) and would discuss the adaptability to other networks in ablation studies.
3.3.1. Framework on A Single Convolutional Layer
For each category and each convolutional layer
, CSCNN learns a tensorthat encodes the impact of category prior knowledge on the channel-wise attention for this layer. We omit the subscript for conciseness. The framework is shown in Fig 2.
Given an intermediate feature map , the output of convolutional layer , CSCNN first learns a channel attention map conditioned on both the current feature map and the category. Then the channel-wise attention is multiplied to the feature map to acquire a refined feature map ,
where denotes the element-wise product with broadcasted along spatial dimensions .
Similarly, CSCNN also learns another tensor that encodes the category prior knowledge for spatial attention . These two attention modules are used sequentially to get a 3D refined feature map ,
where spatial attention is broadcasted along the channel dimension before element-wise product. A practical concern is the large number of parameters in , especially on the first a few layers. To address this problem, we propose to only learn a much smaller tensor , where and , and then resize it to
through linear interpolation. The effect ofand would be discussed with experimental results later. Note that and are randomly initialized and learnt during training, no additional category prior knowledge is needed except category id.
After refined by both channel-wise and spatial attention, is fed to the next layer. Note that CSCNN could be added to any CNNs, by only replacing the input to the next layer from to .
3.3.2. Category-specific Channel-wise Attention
Channel-wise attention tells “what” to focus on. In addition to the inter-channel relationship considered previously, we also exploit the relationship between category prior knowledge and features (Fig 3, top).
To gather spatial information, we first squeeze the spatial dimension of through max and average pooling. The advantage for adopting both is supported by experiments conducted by the CBAM. The two squeezed feature maps are then concatenated with the category prior knowledge and forwarded through a shared two layer MLP, reducing the dimension from to . Finally, we merge the two by element-wise summation.
3.3.3. Category-specific Spatial Attention
Our spatial attention module is illustrated in Fig 3 (bottom). Spatial attention tells where to focus by exploiting the inter-spatial relationship of features. Inspired by the CBAM, we first aggregate channel-wise information of feature map
by average pooling and max pooling along the channel dimension. To incorporate the category prior knowledge, these two are then concatenated withto form an dimensional feature map. Finally, this feature map is forwarded through a convolutional filter to get the attention weights.
|Res18 + CBAM||18.6936||1.8322|
|Res18 + CSCNN||21.6791||1.8329|
3.3.4. Complexity Analysis
Note that CSCNN is actually a light-weighted module. Specifically, we show the number of parameters and giga floating-point operations (GFLOPs) of Baseline, CBAM and our proposed algorithm in Table 2.
We set , and the bottleneck reduction ratio to 4, # categories (real production dataset in Table 7). In the “Shared FC” in each convolutional layer in CBAM, # parameters is . For CSCNN # parameters in FC and channel category embedding are . #params increased compared to CBAM in channel attention for 1 conv layer is 6769k. And, , #additional params in spatial attention is 120k. So, the total params increased is (120k+68k)*16 layers=3.0M. The additional params introduced by us are acceptable, and the additional computation is only 0.03% compared to CBAM.
3.4. System Deployment
We deploy CSCNN for the search advertising system of JD.com, the largest B2C e-commerce company in China, serving the main traffic of hundreds of millions of active users. Fig. 4 depicts the architecture of our online model system.
3.4.1. Offline training
CSCNN is trained jointly with the whole CTR prediction system, on our ten-billion scale real production dataset collected in the last 32 days. In our preliminary investigation, CNN is the key computational bottleneck during training. Taking ResNet18 network with input pictures sized 224x224, a single machine with 4 P40 GPUs can only train 177 million images per day. This means that even only considering CSCNN and with linear speedup in distributed training, we need 226 P40 GPUs to complete the training on the ten-billion impressions within 1 day, which is too expensive. To accelerate, we adopt the sampling strategy in (Chen et al., 2016). At most 25 impressions with the same ad are gathered in one batch. The image embedding of one image is conducted only once and broadcasted to multiple impressions in this batch. Now with 28 P40 GPUs, training is can be finished in 1 day.
3.4.2. Offline inferring:
Images and categories are fed into the well trained CSCNN to inference visual features. Features are made into a lookup table and then loaded in the predictor memory to replace the CSCNN. After dimension reduction and frequency control, a 20 GB lookup table can cover over 90% of the next day impression.
3.4.3. Online serving:
Once a request is received, the visual feature is found directly from the lookup table according to ad id. The predictor returns an estimated CTR. Under the throughput of over 3 million items per second at traffic peak, the tp99 latency of our CPU online serving system is below 20ms.
4. Experimental Results
We exam the effectiveness of both our proposed visual modeling module CSCNN and the whole CTR prediction system. The experiments are organized into two groups:
Ablation studies on CSCNN aims to eliminate the interference from the huge system. We thus test the category-specific attention mechanism by plugging it onto light-weighted CNNs with a very simple CTR prediction model. We use popular benchmark datasets for repeatability.
We further exam the performance gain of our CTR prediction system acquired from the novel visual modeling module. Experiments include both off-line evaluations on a ten-billion scale real production dataset collected from ad click logs (Table 7), and online A/B testing on the real traffic of hundreds of millions of active users on JD.com.
4.1. Ablation Study Setup
Our ablation study is conducted on the “lightest” model. This helps to eliminate the interference from the huge CTR prediction system and focus on our proposed category-specific attention mechanism.
Specifically, our light-weighted CTR model follows the Matrix Factorization (MF) framework, VBPR (He and McAuley, 2016), since it has achieved state-of-the-art performance in comparison with various light models. The preference score of user to ad is predicted as:
where is an offset, are the bias. and are the latent features of and . encodes the latent visual preference of . is the light-weighted CNN.
Following VBPR (He and McAuley, 2016), we use CNN-F (Chatfield et al., 2014) as the base CNN , which consists of only 5 convolutional layers and 3 fully connected layers. We plug CSCNN onto layers from conv-2 to conv-5. For comprehensive analysis, we will further test the effect of plugging attention modules on different layers (Figure 5) and our adaptability to other CNN structures (Table 6) in following sections.
|No Image||With Image||With Image + Category|
under unpaired t-test. CSCNN outperforms all due to 3 advantages: the additional category knowledge, the early fusion of category into CNN, effective structures to learn category-specific inter-channel and inter-spatial dependency.
4.2. Benchmark Datasets
The ablation study is conducted on 3 wildly used benchmark datasets about products on Amazon.com introduced in (McAuley et al., 2015) 111Many works also use Tradesy (He and McAuley, 2016) which is not suitable here due to the absence of category information.. We follow the identical category tree preprocessing method as used in (He et al., 2016b). The dataset statistics after preprocessing are shown in Table 3.
On all 3 datasets, for each user, we randomly withhold one action for validation , another one for testing and all the others for training , following the same split as used in (He and McAuley, 2016). We report test performance of the model with the best AUC on the validation set. When testing, we report performance on two sets of items: All items, and Cold items with fewer than 5 actions in the training set.
4.3. Evaluation Metrics
AUC measures the probability that a randomly sampled positive item has higher preference than a sampled negative one,
where is an indicator function. are indexes for ads. .
In our ablation studies, algorithms are evaluated on AUC which is almost the default off-line evaluation metric in the advertising industry. Empirically, when the CTR prediction model is trained in binary classification, off-line AUC directly reflects the online performance. In JD.com, every 1‰increase in off-line AUC brings 6 million dollars lift in the overall advertising income.
4.4. Compared Algorithms
The compared algorithms are either 1). representative in covering different levels of available information, or 2). reported to achieve state-of-the-art performance thanks to the effective use of category:
VBPR: BPR + visual. The visual features are extracted from pre-trained and fixed CNN (He and McAuley, 2016).
DVBPR: The visual feature extractor CNN is trained end-to-end together with the whole CTR prediction model (Kang et al., 2017).
DVBPR-C: DVBPR + category. The Category information is late fused into MF by sharing among items from the same category.
DeepStyle: Category embedding is subtracted from the visual feature to obtain style information (Liu et al., 2017).
SCA: This algorithm was originally designed for image captioning (Chen et al., 2017) where features of captioning were used in visual attention. To make it a strong baseline in CTR prediction, we slight modify this algorithm by replacing the captioning features to category embedding, so that the category information is early fused into CNN.
In literature, some compared algorithms were originally trained with the pair-wise loss (see Appendix A), which however, is not suitable for the industrial CTR prediction problem. For CTR prediction, the model should be trained in binary classification mode, or termed point-wise loss, so that the scale of directly represents CTR. For fair comparison on the CTR prediction problem, in this ablation study, all algorithms are trained with the point-wise loss. While we also re-do all experiments with the pair-wise loss for consistent comparison with results in literature (Appendix A).
For fair comparison, all algorithms are implemented in the same environment using Tensor-flow, mainly based on the source code of DVBPR 222For details, refer to https://github.com/kang205/DVBPR/tree/b91a21103178867fb70c8d2f77afeeb06fefd32c., following their parameter settings, including learning rate, regularization, batch size and latent dimension etc. Our category prior knowledge dimension is set to , . We will discuss the effects of hyper-parameters in Fig 5.
4.5. Comparison with State-of-the-arts
We aim to show the performance gain from both the incorporating and the effective utilization of valuable visual and category information. Results are shown in Table 4 333AUC on the first 3 columns are from (He and McAuley, 2016).
First, we observe apparent performance gains from additional information along three groups, from “No Image” to “With Image”, to “With Image + Category”, especially on cold items. The gain from BPR-MF to group 2 validates the importance of visual features to CTR prediction. The gain from VBPR to DVBPR supports significance of end-to-end training, which is one of our main motivation. And the gain from group 2 to group 3 validates the importance of category information.
Second, by further comparing AUC within group 3, where all algorithms make use of category information, we find that CSCNN outperforms all. The performance gain lies in the different strategies to use category information. Specifically, Sherlock and DeepStyle incorporate the category into a category-specific linear module used at the end of image embedding. While in DVBPR-C, items from the same category share the latent feature and . All of them late fuse the category information into the model after visual feature extraction. In contrast, CSCNN early incorporates the category prior knowledge into convolutional layers, which enables category-specific inter-channel and inter-spatial dependency learning.
Third, CSCNN also outperforms DVBPR-SCA, a strong baseline created by us through modifying an image captioning algorithm. Although DVBPR-SCA also early fuses the category priors into convolutional layers through attention mechanism, it lacks effective structures to learn the inter-channel and inter-spatial relationships. While our effective FC and convolutional structures in and are able to catch this channel-wise and spatial interdependency.
4.6. Adaptation to Various Attentions
Our key strategy is to use the category prior knowledge to guide the attention. Theoretically, this strategy could be used to improve any self attention module whose attention are originally learnt only from the feature map. To validate the adaptability of CSCNN to various attention mechanism, we test it on three popular self attention structures: SE (Hu et al., 2018b), CBAM-Channel and CBAM-all (Woo et al., 2018). Their results with self attention is shown in Table 5, left. We slightly modify their attention module using CSCNN, i.e. incorporating category prior knowledge or . Results are in Table 5, right.
Our proposed attention mechanism with category prior knowledge (right) significantly outperforms their self attention counterparts (left) in all 3 architectures, validating our adaptability to different attention mechanisms.
4.7. Adaptability to Various Network Backbones
As mentioned, our CSCNN can be easily adopted to any network backbones by replacing the input to the next layer from the feature map to the refined feature map . Now we test the adaptability of CSCNN to Inception V1 (Szegedy et al., 2015), results in Table 6.
CSCNN achieves consistent improvement on CNN-F and Inception V1 over CBAM, which validates our adaptability to different backbones. This improvement also reveals another interesting fact that even for complicated networks (deeper than CNN-F), there is still big room to improve due to the absent of category-specific prior knowledge. This again supports our main motivation.
|Field||# Feats||#Vocab||Feature Example|
|Ad||14||20M||ad id, category, item price, review|
|User||6||400 M||user pin, location, price sensitivity|
|Time||3||62||weekday, hour, date|
|Text||13||40M||query, ad title, query title match|
|History||14||7 Bil||visited brands, categories, shops|
|#Users 0.2 bil.||#Items 0.02 bil.||# Interact 15 bil.||#Cate 3310|
|DCN + CNN fixed||0.7463 (+0.0022)|
|DCN + CNN finetune||0.7500 (+0.0059)|
|DCN + CBAM finetune||0.7506 (+0.0065)|
|DCN + CSCNN||0.7527 (+0.0086)|
|Online A/B Test||CTR Gain||CPC Gain||eCPM Gain|
4.8. Effects of Hyper-Parameters
We introduced 3 hyper-parameters in CSCNN: , the size of category prior for channel-wise attention; , the size of category prior for spatial attention; , the number of layers with CSCNN module. Namely, we add CSCNN to the last convolutional layers of CNN-F. We exam their effects in Fig. 5.
When are small, larger result in higher AUC. This is because larger and are able to carry more detailed information about the category, which further supports the significance of exploiting category prior knowledge in building CNN. But too large will suffer from the overfit problem. We find the optimal setting and . Note that the setting is restricted to this dataset. For datasets with more categories, we conjecture a larger optimal size of .
When , increasing benefits. Namely, the more layers that use CSCNN, the better performance. Furthermore, the continuous growth from to indicates that the AUC gain from CSCNN at different layers are complementary. But, adding CSCNN to cov1 harms the feature extractor. This is because cov1 learns general, low level features such as line, circle, corner, which never needs attention and are naturally independent to category.
4.9. Experiments On Real Production Dataset & Online A/B Testing
Our real production dataset is collected from ad interaction logs in JD.com. We use logs in the first 32 days for training and sample 0.5 million interactions from the 33-th day for testing. The statistics of our ten-billion scale real production dataset is shown in Table 7.
We present our experimental results in Table 8. The performance gain from the fixed CNN validates the importance of visual features. And the gain from finetuning validates the importance of our end-to-end training system. The additional contribution of CBAM demonstrates that emphasizing meaningful features using attention mechanism is beneficial. Our CSCNN goes one step further by early incorporating the category prior knowledge into the convolutional layers, which enables easier inter-channel and inter-spatial dependency learning. Note that on the real production data, 1‰increase in off-line AUC is significant and brings 6 million dollars lift in the overall advertising income of JD.com.
From 2019-Feb-19 to 2019-Feb-26, online A/B testing was conducted in the ranking system of JD. CSCNN contributes 3.22% CTR (Click Through Rate) and 2.46% eCPM (Effective Cost Per Mille) gain compared to the previous DCN model online. Furthermore, CSCNN reduced CPC (Cost Per Click) by 0.62%.
5. Discussion & Conclusion
Apart from the category, what other features could also be adopted to lift the effectiveness of CNN in CTR prediction?
To meet the low latency requirements of the online systems, CNN must be computed offline. So dynamic features including the user and query should not be used. Price, sales and praise ratio are also inappropriate due to the lack of visual prior. Visual related item features including brand, shop id and product words could be adopted. Effectively exploiting them to further lift the effectiveness of CNN in CTR prediction would be a promising direction.
We proposed Category-specific CNN, specially designed for visual-aware CTR prediction in e-commerce. Our early-fusion architecture enables category-specific feature recalibration and emphasizes features that are both important and category related, which contributes to significant performance gain in CTR prediction tasks. With the help of a highly efficient infrastructure, CSCNN has now been deployed in the search advertising system of JD.com, serving the main traffic of hundreds of millions of active users.
- Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Note: CNN-F 5层小网络 Cited by: §4.1.
- Deep ctr prediction in display advertising. In Proceedings of the 24th ACM international conference on Multimedia, pp. 811–820. Note: 第一篇用image 信息的？？ Cited by: §1, §2.1, §3.4.1.
- Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In , pp. 5659–5667. Cited by: §2.2, 7th item.
- Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §2.1.
- Global second-order pooling convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3024–3033. Cited by: §2.2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3.
- Sherlock: sparse hierarchical embeddings for visually-aware one-class collaborative filtering. arXiv preprint arXiv:1604.05813. Cited by: §1, §2.1, 5th item, §4.2.
VBPR: visual bayesian personalized ranking from implicit feedback.
Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Appendix A, §2.1, 2nd item, §4.1, §4.1, §4.2, footnote 1, footnote 3.
- Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §2.1.
- Gather-excite: exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 9401–9411. Cited by: §2.2.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2.2, §2.2, §4.6.
- Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 207–216. Note: Dvbpr vbpr升级版本。。 Cited by: Appendix A, §2.1, 3rd item.
- Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979. Cited by: §2.2.
- DeepStyle: learning user preferences for visual recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–844. Cited by: §1, §2.1, 6th item.
- Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §4.2.
- Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. Cited by: §2.1.
- Image feature learning for cold start problem in display advertising. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1.
-  BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Cited by: Appendix A, 1st item.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.7.
- Online multi-task collaborative filtering for on-the-fly recommender systems. In Proceedings of the 7th ACM conference on Recommender systems, pp. 237–244. Cited by: §2.1.
- Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 12. Cited by: §2.1, §3.2.1.
- Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §2.2, §2.2, §4.6.
- Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.2.
- Learning compositional, visual and relational representations for ctr prediction in sponsored search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2851–2859. Cited by: Appendix B, §2.1.
- Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §2.2.
- What you look matters? offline evaluation of advertising creatives for cold-start problem. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2605–2613. Cited by: §2.1.
Appendix A Additional Experimental Results
In this appendix, we show many additional experimental results that are essential in supporting our claims and consistency with results reported in related works.
In all our previous ablation studies (Table 4, 5 and 6) and training for our online serving systems, we used point-wise loss. Namely, each impression is used as an independent training instance for binary classification. Although this is the default setting for the CTR prediction problem, we find that some existing works, which are also tested on Amazon, use the pair-wise loss (e.g. (Rendle et al., ; He and McAuley, 2016; Kang et al., 2017)).
Specifically, a model with pair-wise loss is trained on a dataset of triplets , where indicates that user prefers ad to ad . Following Bayesian Personalized Ranking (BPR), their objective function is defined as
where is the regularization on all parameters.
To make direct comparison with existing results of pair-wise loss on the same dataset, here we re-do our ablation studies using pair-wise loss, with all other settings identical to that in Table 4, 5 and 6. The results are shown in Table 9, 10 and 11.
From these results, we can draw several observations. First, comparing our results in Table 9, 10 and 11 with that reported in related works, we confirm that our reimplementation of pair-wise loss and compared algorithms is comparable or sometimes better than the original literature. This validates our consistency and repeatability. Second, our CSCNN framework outperforms all compared algorithms in Table 9, 10 and 11, validating the advantages of the CSCNN framework. This superiority indicates that all our previous claims on point-wise loss still hold on pair-wise loss based models. Namely, plugging the CSCNN framework on various self-attention mechanisms and network backbones brings consistently improvement. Third, when comparing the performance across pair-wise and point-wise losses ( Table 9 vs. 4, Table 10 vs. 5 and Table 11 vs. 6), we find that neither of the two losses enjoys absolute superiority over the other one in terms of AUC. However, the metric AUC only measures the relative preference of compared to , not the absolute scale. While in practical advertising industry, point-wise loss is usually preferred since the scale of trained in binary classification directly reflects the click through rate.
Appendix B Statistics of Our Real Production Datasets
In this appendix, we show some specific statistics of our real production datasets, see Figure 6. Most of the features are extremely sparse, e.g. 80% of the queries have appeared less than 6 times in the training set, and follow the long tail distribution. User pin and price are relatively evenly distributed, but still 10% of the features cover 50% of the training set. As claimed in earlier studies (Yang et al., 2019), visual features contributes more when other features are extremely sparse. These statistics illustrate the difficulties in modeling on the real production dataset and the contribution of our methods.