Category-Specific CNN for Visual-aware CTR Prediction at

06/18/2020 ∙ by Hu Liu, et al. ∙, Inc. Tsinghua University 0

As one of the largest B2C e-commerce platforms in China, also powers a leading advertising system, serving millions of advertisers with fingertip connection to hundreds of millions of customers. In our system, as well as most e-commerce scenarios, ads are displayed with images.This makes visual-aware Click Through Rate (CTR) prediction of crucial importance to both business effectiveness and user experience. Existing algorithms usually extract visual features using off-the-shelf Convolutional Neural Networks (CNNs) and late fuse the visual and non-visual features for the finally predicted CTR. Despite being extensively studied, this field still face two key challenges. First, although encouraging progress has been made in offline studies, applying CNNs in real systems remains non-trivial, due to the strict requirements for efficient end-to-end training and low-latency online serving. Second, the off-the-shelf CNNs and late fusion architectures are suboptimal. Specifically, off-the-shelf CNNs were designed for classification thus never take categories as input features. While in e-commerce, categories are precisely labeled and contain abundant visual priors that will help the visual modeling. Unaware of the ad category, these CNNs may extract some unnecessary category-unrelated features, wasting CNN's limited expression ability. To overcome the two challenges, we propose Category-specific CNN (CSCNN) specially for CTR prediction. CSCNN early incorporates the category knowledge with a light-weighted attention-module on each convolutional layer. This enables CSCNN to extract expressive category-specific visual patterns that benefit the CTR prediction. Offline experiments on benchmark and a 10 billion scale real production dataset from JD, together with an Online A/B test show that CSCNN outperforms all compared state-of-the-art algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

As one of the largest B2C e-commerce platforms in China, also powers a leading advertising system, serving millions of advertisers with fingertip connection to hundreds of millions of customers. Every day, customers visit JD, click ads and leave billions of interaction logs. These data not only feed the learning system, but also boost technique revolutions that keep lifting both user experience and advertisers’ profits on

In the commonly used cost-per-click (CPC) advertising system, ads are ranked by effective cost per mile (eCPM), the product of bid price given by advertisers and the CTR predicted by the ad system. Accurate CTR prediction benefits both business effectiveness and user experience. Thus, this topic has attracted widespread interest in both machine learning academia and e-commerce industry.

Nowadays, most of the ads on e-commerce platforms are displayed with images, since they are more visually appealing and convey more details compared to textual descriptions. An interesting observation is that many ads get significantly higher CTR by only switching to more attractive images. This motivates a variety of emerging studies on extracting expressive visual features for CTR prediction (Chen et al., 2016; Mo et al., 2015). These algorithms adopt various off-the-shelf CNNs to extract visual features and fuse them with non-visual features (e.g. Category, user) for the final CTR prediction. With the additional visual features, these algorithms significantly outperform their non-visual counterparts in offline-experiments and generalize well to cold and long-tailed ads. Although encouraging progress has been made in offline studies, applying CNN in real online advertising systems remains non-trival. The offline end-to-end training with CNN must be efficient enough to follow the time-varying online distribution, and the online serving needs to meet the low latency requirements of advertising system.

Furthermore, we notice that visual feature extraction in e-commerce is significantly different from the image classification setting where off-the-shelf CNNs were originally proposed. In classification, categories are regarded as the target to predict. While in e-commerce, categories of ads are clearly labeled, which contain abundant visual priors and will intuitively help visual modeling. Some academic studies have integrated the categorical information by building category-specific projecting matrix on top of the CNN embeddings

(He et al., 2016b) and by explicitly decomposing visual features into styles and categories (Liu et al., 2017). These studies share a common architecture: the late fusion of visual and categorical knowledge, which however, is sub-optimal for CTR prediction. Namely, the image embedding modules seldom take advantage of the categorical knowledge. Unaware of the ad category, the embedding extracted by these CNNs may contain unnecessary features not related to this category, wasting CNN’s limited expression ability. In contrast, if the ad category is integrated, CNN only needs to focus on the category-specific patterns, which will ease the training process.

To overcome the industrial challenges, we build optimized infrastructure for both efficient end-to-end CNN training and low latency online servicing. Base on this efficient infrastructure, we propose Category-specific CNN (CSCNN) specially for the CTR prediction task, to fully utilize the labeled categories in e-commerce. Our key idea is to incorporate the ad category knowledge into the CNN in an early-fusion manner. Inspired by the SE-net (Hu et al., 2018b) and CBAM (Woo et al., 2018) which model the inter-dependencies between convolutional features with a light-weighted self-attention module, CSCNN further incorporates the ad category knowledge and performs a category-specific feature recalibration, as shown in Fig. 2. More clearly, we sequentially apply category-specific channel and spatial attention modules to emphasize features that are both important and category related. These expressive visual features contribute to significant performance gain in the CTR prediction problem.

In summary, we make the following contributions:

  • To the best of our knowledge, we are the first to high-light the negative impact of late fusion of visual and non-visual features in visual-aware CTR prediction.

  • We propose CSCNN, a novel visual embedding module specially for CTR prediction. The key idea is to conduct category-specific channel and spatial self-attention to emphasize features that are both important and category related.

  • We validate the effectiveness of CSCNN through extensive offline experiments and Online A/B test. We verify that the performance of various self-attention mechanisms and network backbones are consistently improved by plugging CSCNN.

  • We build highly efficient infrastructure to apply CNN in the real online e-commerce advertising system. Effective acceleration methods are introduced to accomplish the end-to-end training with CNN on the 10 billion scale real production dataset within 24 hours, and meet the low latency requirements of online system (20ms on CPU). CSCNN has now been deployed in the search advertising system of, one of the largest B2C e-commerce platform in China, serving the main traffic of hundreds of millions of active users.

Figure 1.

The Architecture of our CTR Prediction System. Bottom left: the proposed CSCNN, which embeds an ad image together with its category, to a visual feature vector

. Note that CSCNN only runs offline. While in the online serving system, to meet the low latency requirement, we use an efficient lookup table instead. Bottom right: non-visual feature embedding, from (ad, user, contexts) to a non-visual feature vector . Top: The main architecture, a modified DCN, which takes both the visual feature and the non-visual feature as inputs.

2. Related Work

Our work is closely related to two active research areas: CTR prediction and attention mechanism in CNN.

2.1. CTR Prediction

The aim of CTR prediction is to predict the probability that a user clicks an ad given certain contexts. Accurate CTR prediction benefits both user experience and advertiser’s profits thus is of crucial importance to the e-commerce industry.

Pioneer works in CTR or user preference prediction are based on linear regression (LR)

(McMahan et al., 2013), matrix factorization (MF) (Wang et al., 2013)

and decision trees

(He et al., 2014)

. Recent years have witnessed many successful applications of deep learning in CTR prediction

(Cheng et al., 2016; Wang et al., 2017). Early works usually only make use of non-visual features, which however, is insufficient nowadays. Currently, most ads are displayed with images which contain plentiful visual details and largely affect users’ preference. This motivates many emerging studies on visual aware CTR prediction (Chen et al., 2016; He and McAuley, 2016; Kang et al., 2017; He et al., 2016b; Liu et al., 2017; Yang et al., 2019; Zhao et al., 2019). They first extract visual features using various off-the-shelf CNNs, and then fuse the visual features with non-visual features including categories, to build the preference predictor. Unfortunately, this late fusion is actually sub-optimal or even a waste in e-commerce scenario, where categories are clearly labeled and contain abundant visual priors that may somehow help visual feature extraction.

In contrast to existing works with late fusion, CSCNN differs fundamentally in early incorporating categorical knowledge into the convolutional layers, allowing easy category-specific inter-channel and inter-spatial dependency learning.

class label sigmoid predicted CTR
dateset real number set feature dimension
loss feature vector prediction function
layer index non-visual features
hidden layer visual features
cross layer cross layer parameter
concatenation embedding dimension
embedded feature embedding dictionary
vocabulary size one/multi hot coding
ad image ad category
# layers spatial category prior
# channels channel category prior
width category set
height original feature map
attention map refined feature map
size of element-wise product
Table 1. Important Notations Used in Section 3

2.2. Attention Mechanism in CNN

Attention mechanism is an important feature selection approach that helps CNN to emphasize important parts of feature maps and suppress unimportant ones. Spatial attention tells

where (Woo et al., 2018) and channel-wise attention tells what to focus on (Hu et al., 2018b).

In literature, many works have attempted to learn the attention weights from the feature map, termed self-attention. State-of-the-art algorithms include CBAM (Woo et al., 2018), SE (Hu et al., 2018b) among others (Hu et al., 2018a; Gao et al., 2019). Besides self attention, the attention weights can also be conditioned on external information, for example nature language. Successful application fields include search by language (Li et al., 2017)

, image captioning

(Xu et al., 2015; Chen et al., 2017) and visual question answering (Yang et al., 2016).

Our work is motivated by the attention mechanism. Rather than vision & language, we design novel architectures to adapt attention mechanism to address an important but long overlooked issue, the sub-optimal late fusion of vision and non-vision features in CTR prediction. We combine the advantages of both self-attention and attention conditioned on external information, namely the ad category. As a result, our image embedding is able to emphasize features that are both important and category related.

3. The CTR Prediction System in

We first review the background of the CTR prediction in Section 3.1. Then we describe the architecture of our CTR prediction system in Section 3.2. We further dig into details of our novel visual modeling module, the Category-Specific CNN in Section 3.3. Finally, we introduce essential accelerating strategies for online deployment in Section 3.4. The notations are summarized in Table 1.

3.1. Preliminaries

In the online advertising industry, when an ad is shown to a user under some contexts, this scenario is counted as an impression. The aim of CTR prediction is to predict the probability that a positive feedback, i.e. click, takes place in an impression (ad, user, contexts). Accurate CTR prediction directly benefits both the user experience and business effectiveness, which makes this task of crucial importance to the whole advertising industry.

CTR prediction is usually formulated as binary classification. Specifically, the goal is to learn a prediction function from a training set , where is the feature vector of the -th impression and is the class label that denotes whether a click takes place.

The objective function is defined as the negative log-likelihood:


where is the predicted CTR, scaled to by sigmoid :


3.2. The Architecture of CTR Prediction System

We now describe the architecture of our CTR prediction system that is serving on Details are shown in Fig 1.

Figure 2. Our proposed Category-Specific CNN framework. Note that CSCNN can be added to any single convolutional layer, but for easy illustration, we only show details on single layer. Top: A map from the category to category prior knowledge that affects channel-wise and spatial attentions. Bottom: is the output feature map of the current convolutional layer. Refined by channel-wise and spatial attention sequentially, the new feature map is used as the input to the next layer.

3.2.1. Deep & Cross Network

Deep & Cross network (DCN) (Wang et al., 2017) has achieved promising performance thanks to the ability to learn effective feature interactions. Here, we properly modify the DCN to take two inputs, a non-visual feature vector and a visual feature vector .

The visual feature is incorporated to the deep net. In layer 1, we transform the non-visual feature to 1024 dimension and concatenate it with the visual feature,


Two deep layers follows,


The cross net is used to process non-visual feature,


where the input , for layer .

Finally, we combine the outputs for the predicted CTR,


3.2.2. Non-visual Feature Embedding

We now describe the embedding layer that transforms raw non-visual features of an impression, namely (ad, user, contexts), to the vector .

We assume that all features come in the categorical form (after preprocessings e.g. binning). Usually, a categorical feature is encoded in a one-hot / multi-hot vector , where is the vocabulary size of this feature. We show two examples below:

WeekDay=Wed          [0,0,0,1,0,0,0]

TitleWords=[Summer, Dress]       […,0,1,0,…,0,1,0…]

Unfortunately, this one/multi-hot coding is not applicable to industrial systems due to the extreme high dimensionality and sparsity. We thus adopt a low dimensional embedding strategy in our system,


where is the embedding dictionary for this specific feature and is the embedding size. We then concatenate the ’s of all features to build .

In practice, our system makes use of 95 non-visual features from users (historical clicks /purchases, location etc.), ads (category, title, # reviews etc.) and rich contexts (query words, visit time etc.) with 7 billion vocabularies in total. Setting , the total dimension is . We will further introduce the features and their statistics in Table 8 Appendix B.

Figure 3. Modules of our proposed Category-Specific CNN: Channel-wise Attention (top) and Spatial Attention (bottom).

3.3. Category-Specific CNN

Conventional CTR prediction systems mostly embed ad images using off-the-shelf CNNs. We say off-the-shelf since they were originally designed for classification, not for CTR prediction. They regard the image category as the target to predict, not as inputs. This is actually a huge waste on e-commerce platforms, where categories are precisely labeled and contain plentiful visual prior knowledge that would help visual modeling.

We address this issue by proposing a novel CNN specifically for CTR prediction, Category-Specific CNN, that embeds an ad image , together with the ad category , to the visual feature . Specifically, the category prior knowledge is encoded as category embeddings (trained jointly with the CTR model) and incorporated to the CNN using a conditional attention mechanism.

Theoretically, CSCNN can be adopted to any convoluation layer in any network. In our systems, we plug CSCNN on ResNet18 (He et al., 2016a) and would discuss the adaptability to other networks in ablation studies.

3.3.1. Framework on A Single Convolutional Layer

For each category and each convolutional layer

, CSCNN learns a tensor

that encodes the impact of category prior knowledge on the channel-wise attention for this layer. We omit the subscript for conciseness. The framework is shown in Fig 2.

Given an intermediate feature map , the output of convolutional layer , CSCNN first learns a channel attention map conditioned on both the current feature map and the category. Then the channel-wise attention is multiplied to the feature map to acquire a refined feature map ,


where denotes the element-wise product with broadcasted along spatial dimensions .

Similarly, CSCNN also learns another tensor that encodes the category prior knowledge for spatial attention . These two attention modules are used sequentially to get a 3D refined feature map ,


where spatial attention is broadcasted along the channel dimension before element-wise product. A practical concern is the large number of parameters in , especially on the first a few layers. To address this problem, we propose to only learn a much smaller tensor , where and , and then resize it to

through linear interpolation. The effect of

and would be discussed with experimental results later. Note that and are randomly initialized and learnt during training, no additional category prior knowledge is needed except category id.

After refined by both channel-wise and spatial attention, is fed to the next layer. Note that CSCNN could be added to any CNNs, by only replacing the input to the next layer from to .

3.3.2. Category-specific Channel-wise Attention

Channel-wise attention tells “what” to focus on. In addition to the inter-channel relationship considered previously, we also exploit the relationship between category prior knowledge and features (Fig 3, top).

To gather spatial information, we first squeeze the spatial dimension of through max and average pooling. The advantage for adopting both is supported by experiments conducted by the CBAM. The two squeezed feature maps are then concatenated with the category prior knowledge and forwarded through a shared two layer MLP, reducing the dimension from to . Finally, we merge the two by element-wise summation.


3.3.3. Category-specific Spatial Attention

Our spatial attention module is illustrated in Fig 3 (bottom). Spatial attention tells where to focus by exploiting the inter-spatial relationship of features. Inspired by the CBAM, we first aggregate channel-wise information of feature map

by average pooling and max pooling along the channel dimension. To incorporate the category prior knowledge, these two are then concatenated with

to form an dimensional feature map. Finally, this feature map is forwarded through a convolutional filter to get the attention weights.

Algorithm # params/M #GFLOPs
Res18 17.9961 1.8206
Res18 + CBAM 18.6936 1.8322
Res18 + CSCNN 21.6791 1.8329
Table 2. # Parameters and # GFLOPs of CSCNN and Baselines. We use ResNet18 as the baseline and the backbone network to adopt CBAM and CSCNN modules. Note that there is only 0.03% addition computation from CBAM to CSCNN.

3.3.4. Complexity Analysis

Note that CSCNN is actually a light-weighted module. Specifically, we show the number of parameters and giga floating-point operations (GFLOPs) of Baseline, CBAM and our proposed algorithm in Table 2.

We set , and the bottleneck reduction ratio to 4, # categories (real production dataset in Table 7). In the “Shared FC” in each convolutional layer in CBAM, # parameters is . For CSCNN # parameters in FC and channel category embedding are . #params increased compared to CBAM in channel attention for 1 conv layer is 6769k. And, , #additional params in spatial attention is 120k. So, the total params increased is (120k+68k)*16 layers=3.0M. The additional params introduced by us are acceptable, and the additional computation is only 0.03% compared to CBAM.

Figure 4. The architecture of the online model system.

3.4. System Deployment

We deploy CSCNN for the search advertising system of, the largest B2C e-commerce company in China, serving the main traffic of hundreds of millions of active users. Fig. 4 depicts the architecture of our online model system.

3.4.1. Offline training

CSCNN is trained jointly with the whole CTR prediction system, on our ten-billion scale real production dataset collected in the last 32 days. In our preliminary investigation, CNN is the key computational bottleneck during training. Taking ResNet18 network with input pictures sized 224x224, a single machine with 4 P40 GPUs can only train 177 million images per day. This means that even only considering CSCNN and with linear speedup in distributed training, we need 226 P40 GPUs to complete the training on the ten-billion impressions within 1 day, which is too expensive. To accelerate, we adopt the sampling strategy in (Chen et al., 2016). At most 25 impressions with the same ad are gathered in one batch. The image embedding of one image is conducted only once and broadcasted to multiple impressions in this batch. Now with 28 P40 GPUs, training is can be finished in 1 day.

3.4.2. Offline inferring:

Images and categories are fed into the well trained CSCNN to inference visual features. Features are made into a lookup table and then loaded in the predictor memory to replace the CSCNN. After dimension reduction and frequency control, a 20 GB lookup table can cover over 90% of the next day impression.

3.4.3. Online serving:

Once a request is received, the visual feature is found directly from the lookup table according to ad id. The predictor returns an estimated CTR. Under the throughput of over 3 million items per second at traffic peak, the tp99 latency of our CPU online serving system is below 20ms.

Dataset #Users #Items # Interact #Category
Fashion 64,583 234,892 513,367 49
Women 97,678 347,591 827,678 87
Men 34,244 110,636 254,870 62
Table 3. Amazon Benchmark Dataset Statistics.

4. Experimental Results

We exam the effectiveness of both our proposed visual modeling module CSCNN and the whole CTR prediction system. The experiments are organized into two groups:

  • Ablation studies on CSCNN aims to eliminate the interference from the huge system. We thus test the category-specific attention mechanism by plugging it onto light-weighted CNNs with a very simple CTR prediction model. We use popular benchmark datasets for repeatability.

  • We further exam the performance gain of our CTR prediction system acquired from the novel visual modeling module. Experiments include both off-line evaluations on a ten-billion scale real production dataset collected from ad click logs (Table 7), and online A/B testing on the real traffic of hundreds of millions of active users on

4.1. Ablation Study Setup

Our ablation study is conducted on the “lightest” model. This helps to eliminate the interference from the huge CTR prediction system and focus on our proposed category-specific attention mechanism.

Specifically, our light-weighted CTR model follows the Matrix Factorization (MF) framework, VBPR (He and McAuley, 2016), since it has achieved state-of-the-art performance in comparison with various light models. The preference score of user to ad is predicted as:


where is an offset, are the bias. and are the latent features of and . encodes the latent visual preference of . is the light-weighted CNN.

Following VBPR (He and McAuley, 2016), we use CNN-F (Chatfield et al., 2014) as the base CNN , which consists of only 5 convolutional layers and 3 fully connected layers. We plug CSCNN onto layers from conv-2 to conv-5. For comprehensive analysis, we will further test the effect of plugging attention modules on different layers (Figure 5) and our adaptability to other CNN structures (Table 6) in following sections.

No Image With Image With Image + Category
Datasets BPR-MF VBPR DVBPR DVBPR-C Sherlock DeepStyle DVBPR-SCA Ours
Fashion All 0.6147 0.7557 0.8011 0.8022 0.7640 0.7530 0.8032 0.8156
Cold 0.5334 0.7476 0.7712 0.7703 0.7427 0.7465 0.7694 0.7882
Women All 0.6506 0.7238 0.7624 0.7645 0.7265 0.7232 0.7772 0.7931
Cold 0.5198 0.7086 0.7078 0.7099 0.6945 0.7120 0.7273 0.7523
Men All 0.6321 0.7079 0.7491 0.7549 0.7239 0.7279 0.7547 0.7749
Cold 0.5331 0.6880 0.6985 0.7018 0.6910 0.7210 0.7048 0.7315
Table 4. Comparison with State-of-the-arts. For all algorithms, we report the mean over 5 runs with different random parameter initialization and instance permutations. The std 0.1%, so the improvement is extremely statistically significant

under unpaired t-test. CSCNN outperforms all due to 3 advantages: the additional category knowledge, the early fusion of category into CNN, effective structures to learn category-specific inter-channel and inter-spatial dependency.

4.2. Benchmark Datasets

The ablation study is conducted on 3 wildly used benchmark datasets about products on introduced in (McAuley et al., 2015) 111Many works also use Tradesy (He and McAuley, 2016) which is not suitable here due to the absence of category information.. We follow the identical category tree preprocessing method as used in (He et al., 2016b). The dataset statistics after preprocessing are shown in Table 3.

On all 3 datasets, for each user, we randomly withhold one action for validation , another one for testing and all the others for training , following the same split as used in (He and McAuley, 2016). We report test performance of the model with the best AUC on the validation set. When testing, we report performance on two sets of items: All items, and Cold items with fewer than 5 actions in the training set.

4.3. Evaluation Metrics

AUC measures the probability that a randomly sampled positive item has higher preference than a sampled negative one,


where is an indicator function. are indexes for ads. .

In our ablation studies, algorithms are evaluated on AUC which is almost the default off-line evaluation metric in the advertising industry. Empirically, when the CTR prediction model is trained in binary classification, off-line AUC directly reflects the online performance. In, every 1‰increase in off-line AUC brings 6 million dollars lift in the overall advertising income.

4.4. Compared Algorithms

The compared algorithms are either 1). representative in covering different levels of available information, or 2). reported to achieve state-of-the-art performance thanks to the effective use of category:

  • BPR-MF: The Bayesian Personalized Ranking (BPR) (Rendle et al., ), No visual features. Only includes the first 4 terms in Eq (12).

  • VBPR: BPR + visual. The visual features are extracted from pre-trained and fixed CNN (He and McAuley, 2016).

  • DVBPR: The visual feature extractor CNN is trained end-to-end together with the whole CTR prediction model (Kang et al., 2017).

  • DVBPR-C: DVBPR + category. The Category information is late fused into MF by sharing among items from the same category.

  • Sherlock: DVBPR + category. Category

    is used in the linear transform

    after the visual feature extractor (He et al., 2016b).

  • DeepStyle: Category embedding is subtracted from the visual feature to obtain style information (Liu et al., 2017).

  • SCA: This algorithm was originally designed for image captioning (Chen et al., 2017) where features of captioning were used in visual attention. To make it a strong baseline in CTR prediction, we slight modify this algorithm by replacing the captioning features to category embedding, so that the category information is early fused into CNN.

In literature, some compared algorithms were originally trained with the pair-wise loss (see Appendix A), which however, is not suitable for the industrial CTR prediction problem. For CTR prediction, the model should be trained in binary classification mode, or termed point-wise loss, so that the scale of directly represents CTR. For fair comparison on the CTR prediction problem, in this ablation study, all algorithms are trained with the point-wise loss. While we also re-do all experiments with the pair-wise loss for consistent comparison with results in literature (Appendix A).

For fair comparison, all algorithms are implemented in the same environment using Tensor-flow, mainly based on the source code of DVBPR 222For details, refer to, following their parameter settings, including learning rate, regularization, batch size and latent dimension etc. Our category prior knowledge dimension is set to , . We will discuss the effects of hyper-parameters in Fig 5.

4.5. Comparison with State-of-the-arts

We aim to show the performance gain from both the incorporating and the effective utilization of valuable visual and category information. Results are shown in Table 4 333AUC on the first 3 columns are from (He and McAuley, 2016).

First, we observe apparent performance gains from additional information along three groups, from “No Image” to “With Image”, to “With Image + Category”, especially on cold items. The gain from BPR-MF to group 2 validates the importance of visual features to CTR prediction. The gain from VBPR to DVBPR supports significance of end-to-end training, which is one of our main motivation. And the gain from group 2 to group 3 validates the importance of category information.

Second, by further comparing AUC within group 3, where all algorithms make use of category information, we find that CSCNN outperforms all. The performance gain lies in the different strategies to use category information. Specifically, Sherlock and DeepStyle incorporate the category into a category-specific linear module used at the end of image embedding. While in DVBPR-C, items from the same category share the latent feature and . All of them late fuse the category information into the model after visual feature extraction. In contrast, CSCNN early incorporates the category prior knowledge into convolutional layers, which enables category-specific inter-channel and inter-spatial dependency learning.

Third, CSCNN also outperforms DVBPR-SCA, a strong baseline created by us through modifying an image captioning algorithm. Although DVBPR-SCA also early fuses the category priors into convolutional layers through attention mechanism, it lacks effective structures to learn the inter-channel and inter-spatial relationships. While our effective FC and convolutional structures in and are able to catch this channel-wise and spatial interdependency.

Original +CSCNN
All Cold All Cold
No Attention 0.7491 0.6985
SE 0.7500 0.6989 0.7673 0.7153
CBAM-Channel 0.7506 0.7002 0.7683 0.7184
CBAM-All 0.7556 0.7075 0.7749 0.7315
Table 5. Adaptability to Various Attention Mechanisms (Amazon Men). Left: self attention nets. Right: our modified SE, CBAM-Channel, CBAM-All, by incorporating category prior knowledge or to attention using CSCNN. Results with CSCNN (right) significantly outperform original results (left), validating our effectiveness and adaptability.
Figure 5. Effects of Hyper-Parameters (Men, all). : The size of category prior for channel attention. : The size of category prior for spatial attention. : Last # layers to apply CSCNN.

4.6. Adaptation to Various Attentions

Our key strategy is to use the category prior knowledge to guide the attention. Theoretically, this strategy could be used to improve any self attention module whose attention are originally learnt only from the feature map. To validate the adaptability of CSCNN to various attention mechanism, we test it on three popular self attention structures: SE (Hu et al., 2018b), CBAM-Channel and CBAM-all (Woo et al., 2018). Their results with self attention is shown in Table 5, left. We slightly modify their attention module using CSCNN, i.e. incorporating category prior knowledge or . Results are in Table 5, right.

Our proposed attention mechanism with category prior knowledge (right) significantly outperforms their self attention counterparts (left) in all 3 architectures, validating our adaptability to different attention mechanisms.

4.7. Adaptability to Various Network Backbones

As mentioned, our CSCNN can be easily adopted to any network backbones by replacing the input to the next layer from the feature map to the refined feature map . Now we test the adaptability of CSCNN to Inception V1 (Szegedy et al., 2015), results in Table 6.

CSCNN achieves consistent improvement on CNN-F and Inception V1 over CBAM, which validates our adaptability to different backbones. This improvement also reveals another interesting fact that even for complicated networks (deeper than CNN-F), there is still big room to improve due to the absent of category-specific prior knowledge. This again supports our main motivation.

CNN-F Inception
No Attention All 0.7491 0.7747
Cold 0.6985 0.7259
CBAM All 0.7556 0.7794
Cold 0.7075 0.7267
CSCNN All 0.7749 0.7852
Cold 0.7315 0.7386
Table 6. Adaptability to Different Backbones (Amazon Men). We observe consistent improvement on CNN-F and Inception V1 over the self attention CBAM and no attention base, validating our adaptability.
Field # Feats #Vocab Feature Example
Ad 14 20M ad id, category, item price, review
User 6 400 M user pin, location, price sensitivity
Time 3 62 weekday, hour, date
Text 13 40M query, ad title, query title match
History 14 7 Bil visited brands, categories, shops
#Users 0.2 bil. #Items 0.02 bil. # Interact 15 bil. #Cate 3310

Table 7. Real Production Dataset Statistics. Bil, Feats are short for Billion, Features. Besides the features listed, we also do manual feature interaction making the total # features= 95.
Offline AUC
DCN 0.7441
DCN + CNN fixed 0.7463 (+0.0022)
DCN + CNN finetune 0.7500 (+0.0059)
DCN + CBAM finetune 0.7506 (+0.0065)
DCN + CSCNN 0.7527 (+0.0086)
Online A/B Test CTR Gain CPC Gain eCPM Gain
DCN 0 0 0
DCN+CSCNN 3.22% -0.62% 2.46%
Table 8. Experiments on Real Production Dataset.

4.8. Effects of Hyper-Parameters

We introduced 3 hyper-parameters in CSCNN: , the size of category prior for channel-wise attention; , the size of category prior for spatial attention; , the number of layers with CSCNN module. Namely, we add CSCNN to the last convolutional layers of CNN-F. We exam their effects in Fig. 5.

When are small, larger result in higher AUC. This is because larger and are able to carry more detailed information about the category, which further supports the significance of exploiting category prior knowledge in building CNN. But too large will suffer from the overfit problem. We find the optimal setting and . Note that the setting is restricted to this dataset. For datasets with more categories, we conjecture a larger optimal size of .

When , increasing benefits. Namely, the more layers that use CSCNN, the better performance. Furthermore, the continuous growth from to indicates that the AUC gain from CSCNN at different layers are complementary. But, adding CSCNN to cov1 harms the feature extractor. This is because cov1 learns general, low level features such as line, circle, corner, which never needs attention and are naturally independent to category.

4.9. Experiments On Real Production Dataset & Online A/B Testing

Our real production dataset is collected from ad interaction logs in We use logs in the first 32 days for training and sample 0.5 million interactions from the 33-th day for testing. The statistics of our ten-billion scale real production dataset is shown in Table 7.

We present our experimental results in Table 8. The performance gain from the fixed CNN validates the importance of visual features. And the gain from finetuning validates the importance of our end-to-end training system. The additional contribution of CBAM demonstrates that emphasizing meaningful features using attention mechanism is beneficial. Our CSCNN goes one step further by early incorporating the category prior knowledge into the convolutional layers, which enables easier inter-channel and inter-spatial dependency learning. Note that on the real production data, 1‰increase in off-line AUC is significant and brings 6 million dollars lift in the overall advertising income of

From 2019-Feb-19 to 2019-Feb-26, online A/B testing was conducted in the ranking system of JD. CSCNN contributes 3.22% CTR (Click Through Rate) and 2.46% eCPM (Effective Cost Per Mille) gain compared to the previous DCN model online. Furthermore, CSCNN reduced CPC (Cost Per Click) by 0.62%.

5. Discussion & Conclusion

Apart from the category, what other features could also be adopted to lift the effectiveness of CNN in CTR prediction?

To meet the low latency requirements of the online systems, CNN must be computed offline. So dynamic features including the user and query should not be used. Price, sales and praise ratio are also inappropriate due to the lack of visual prior. Visual related item features including brand, shop id and product words could be adopted. Effectively exploiting them to further lift the effectiveness of CNN in CTR prediction would be a promising direction.

We proposed Category-specific CNN, specially designed for visual-aware CTR prediction in e-commerce. Our early-fusion architecture enables category-specific feature recalibration and emphasizes features that are both important and category related, which contributes to significant performance gain in CTR prediction tasks. With the help of a highly efficient infrastructure, CSCNN has now been deployed in the search advertising system of, serving the main traffic of hundreds of millions of active users.


  • K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv preprint arXiv:1405.3531. Note: CNN-F 5层小网络 Cited by: §4.1.
  • J. Chen, B. Sun, H. Li, H. Lu, and X. Hua (2016) Deep ctr prediction in display advertising. In Proceedings of the 24th ACM international conference on Multimedia, pp. 811–820. Note: 第一篇用image 信息的?? Cited by: §1, §2.1, §3.4.1.
  • L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017) Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 5659–5667. Cited by: §2.2, 7th item.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016) Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: §2.1.
  • Z. Gao, J. Xie, Q. Wang, and P. Li (2019) Global second-order pooling convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3024–3033. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3.
  • R. He, C. Lin, J. Wang, and J. McAuley (2016b) Sherlock: sparse hierarchical embeddings for visually-aware one-class collaborative filtering. arXiv preprint arXiv:1604.05813. Cited by: §1, §2.1, 5th item, §4.2.
  • R. He and J. McAuley (2016) VBPR: visual bayesian personalized ranking from implicit feedback. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: Appendix A, §2.1, 2nd item, §4.1, §4.1, §4.2, footnote 1, footnote 3.
  • X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. (2014) Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 1–9. Cited by: §2.1.
  • J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi (2018a) Gather-excite: exploiting feature context in convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 9401–9411. Cited by: §2.2.
  • J. Hu, L. Shen, and G. Sun (2018b) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1, §2.2, §2.2, §4.6.
  • W. Kang, C. Fang, Z. Wang, and J. McAuley (2017) Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE International Conference on Data Mining (ICDM), pp. 207–216. Note: Dvbpr vbpr升级版本。。 Cited by: Appendix A, §2.1, 3rd item.
  • S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang (2017) Person search with natural language description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1970–1979. Cited by: §2.2.
  • Q. Liu, S. Wu, and L. Wang (2017) DeepStyle: learning user preferences for visual recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–844. Cited by: §1, §2.1, 6th item.
  • J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. Cited by: §4.2.
  • H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. (2013) Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1222–1230. Cited by: §2.1.
  • K. Mo, B. Liu, L. Xiao, Y. Li, and J. Jiang (2015) Image feature learning for cold start problem in display advertising. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §1.
  • [18] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme BPR: bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, Cited by: Appendix A, 1st item.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.7.
  • J. Wang, S. C. Hoi, P. Zhao, and Z. Liu (2013) Online multi-task collaborative filtering for on-the-fly recommender systems. In Proceedings of the 7th ACM conference on Recommender systems, pp. 237–244. Cited by: §2.1.
  • R. Wang, B. Fu, G. Fu, and M. Wang (2017) Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17, pp. 12. Cited by: §2.1, §3.2.1.
  • S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §2.2, §2.2, §4.6.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.2.
  • X. Yang, T. Deng, W. Tan, X. Tao, J. Zhang, S. Qin, and Z. Ding (2019) Learning compositional, visual and relational representations for ctr prediction in sponsored search. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2851–2859. Cited by: Appendix B, §2.1.
  • Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29. Cited by: §2.2.
  • Z. Zhao, L. Li, B. Zhang, M. Wang, Y. Jiang, L. Xu, F. Wang, and W. Ma (2019) What you look matters? offline evaluation of advertising creatives for cold-start problem. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2605–2613. Cited by: §2.1.

Appendix A Additional Experimental Results

In this appendix, we show many additional experimental results that are essential in supporting our claims and consistency with results reported in related works.

In all our previous ablation studies (Table 4, 5 and 6) and training for our online serving systems, we used point-wise loss. Namely, each impression is used as an independent training instance for binary classification. Although this is the default setting for the CTR prediction problem, we find that some existing works, which are also tested on Amazon, use the pair-wise loss (e.g. (Rendle et al., ; He and McAuley, 2016; Kang et al., 2017)).

Specifically, a model with pair-wise loss is trained on a dataset of triplets , where indicates that user prefers ad to ad . Following Bayesian Personalized Ranking (BPR), their objective function is defined as


where is the regularization on all parameters.

To make direct comparison with existing results of pair-wise loss on the same dataset, here we re-do our ablation studies using pair-wise loss, with all other settings identical to that in Table 4, 5 and 6. The results are shown in Table 9, 10 and 11.

From these results, we can draw several observations. First, comparing our results in Table 9, 10 and 11 with that reported in related works, we confirm that our reimplementation of pair-wise loss and compared algorithms is comparable or sometimes better than the original literature. This validates our consistency and repeatability. Second, our CSCNN framework outperforms all compared algorithms in Table 9, 10 and 11, validating the advantages of the CSCNN framework. This superiority indicates that all our previous claims on point-wise loss still hold on pair-wise loss based models. Namely, plugging the CSCNN framework on various self-attention mechanisms and network backbones brings consistently improvement. Third, when comparing the performance across pair-wise and point-wise losses ( Table 9 vs. 4, Table 10 vs. 5 and Table 11 vs. 6), we find that neither of the two losses enjoys absolute superiority over the other one in terms of AUC. However, the metric AUC only measures the relative preference of compared to , not the absolute scale. While in practical advertising industry, point-wise loss is usually preferred since the scale of trained in binary classification directly reflects the click through rate.

Appendix B Statistics of Our Real Production Datasets

In this appendix, we show some specific statistics of our real production datasets, see Figure 6. Most of the features are extremely sparse, e.g. 80% of the queries have appeared less than 6 times in the training set, and follow the long tail distribution. User pin and price are relatively evenly distributed, but still 10% of the features cover 50% of the training set. As claimed in earlier studies (Yang et al., 2019), visual features contributes more when other features are extremely sparse. These statistics illustrate the difficulties in modeling on the real production dataset and the contribution of our methods.

Figure 6. Feature statistics from the search advertising system of from 20200106 to 20200206 (32 days). For each feature, values are sorted in descending order of frequency.