A Hybrid Bandit Model with Visual Priors for Creative Ranking in Display Advertising

02/08/2021 ∙ by Shiyao Wang, et al. ∙ USTC 9

Creative plays a great important role in e-commerce for exhibiting products. Sellers usually create multiple creatives for comprehensive demonstrations, thus it is crucial to display the most appealing design to maximize the Click-Through Rate (CTR). For this purpose, modern recommender systems dynamically rank creatives when a product is proposed for a user. However, this task suffers more cold-start problem than conventional products recommendation In this paper, we propose a hybrid bandit model with visual priors which first makes predictions with a visual evaluation, and then naturally evolves to focus on the specialities through the hybrid bandit model. Our contributions are three-fold: 1) We present a visual-aware ranking model (called VAM) that incorporates a list-wise ranking loss for ordering the creatives according to the visual appearance. 2) Regarding visual evaluations as a prior, the hybrid bandit model (called HBM) is proposed to evolve consistently to make better posteriori estimations by taking more observations into consideration for online scenarios. 3) A first large-scale creative dataset, CreativeRanking, is constructed, which contains over 1.7M creatives of 500k products as well as their real impression and click data. Extensive experiments have also been conducted on both our dataset and public Mushroom dataset, demonstrating the effectiveness of the proposed method.



There are no comments yet.


page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Online display advertising is a rapid growing business and has become an important source of revenue for Internet service providers. The advertisements are delivered to customers through various online channels, e.g. e-commerce platform. Image ads are the most widely used format since they are more compact, intuitive and comprehensible (Chen et al., 2016). In Figure 1, each row composes several ad images that describe the same product for comprehensive demonstrations. These images are called creatives. Although the creatives represent the same product, they may have largely different CTRs due to the visual appearance. Thus it is crucial to display the most appealing design to attract the potentially interested customers and maximize the Click-Through Rate(CTR).

Figure 1. Some examples of ad creatives. Each row presents creatives that display the product in multiple ways. The corresponding CTRs at the bottom row indicate the large CTR gap among creatives.

In order to explore the most appealing creative, all of the candidates should be displayed to customers. Meanwhile, to ensure the overall performance of advertising, we prefer to display the creative that has the highest predicted CTR so far. This procedure can be modeled as a typical multi-armed bandit problem (MAB). It not only focuses on maximizing cumulative rewards (clicks) but also balance the exploration-exploitation(E&E) trade-off within a limited exploration resource so that CTR are considered. Epsilon-greedy (François-Lavet et al., 2018)

, Thompson sampling

(Russo and Roy, 2014) and Upper Confidence Bounds (UCB) approaches (Auer et al., 2002) are widely used strategies to deal with the bandit problem. However, creatives potentially change more frequently than products, and most of them cannot have sufficient impression opportunities to get reliable CTRs throughout their lifetime. So the conventional bandit models may suffer from cold-start problem in the initial random exploration period, hurting the online performance extremely. One potential solution to this problem is incorporating visual prior knowledge to facilitate a better exploration. (Azimi et al., 2012; Cheng et al., 2012; Mo et al., 2015; Chen et al., 2016)

consider the visual features extracted by deep convolutional networks and make deterministic selections for product recommendation. These deep models are in a heavy computation and cannot be flexibly updated online. Besides, the deterministic and greedy strategy may result in suboptimal solution due to the lack of exploration. Consequently, how to combine both the expressive visual representations and flexible bandit model remains a challenging problem.

In this paper, we propose an elegant method which incorporates visual prior knowledge into bandit model for facilitating a better exploration. It is based on a framework called NeuralLinear (Riquelme et al., 2018)

. They consider approximate bayesian neural networks in a Thompson sampling framework to utilize both the learning ability of neural networks and the posterior sampling method. By adopting this general framework, we first present a novel convolutional network with a list-wise ranking loss function to select the most attractive creative. The ranking loss concentrates on capturing the visual patterns related to attractiveness, and the learned representations are treated as contextual information for the bandit model. Second, in terms of the bandit model, we make two major improvements: 1) Instead of randomly setting a prior hyperparameter to candidate arms, we use the weights of neural network to initialize the bandit parameters that further enhance the performance in the cold-starting phase. 2) To fit the industrial-scale data, we extend the linear regression model of NeuralLinear to a hybrid model which adopts two individual parameters, i.e. product-wise ones and creative-specific ones. The two components are adaptively combined during the exploring period. Last but not the least, because the creative ranking is a novel problem, it lacks real-world data for further study and comparison. To this end, we contribute a large-scale creative dataset

111The Data and code are publicly available at https://github.com/alimama-creative/A_Hybrid_Bandit_Model_with_Visual_Priors_for_Creative_Ranking.git from Alibaba display advertising platform that comprises more than 500k products and 1.7M ad creatives.

In summary, the contributions of this paper include:

- We present a visual-aware ranking model (called VAM) that is capable of evaluating new creatives according to the visual appearance.

- Regarding the learned visual predictions as a prior, the improved hybrid bandit model (called HBM) is proposed to make better posteriori estimations by taking more observations into consideration.

- We construct a novel large-scale creative dataset named CreativeRanking1. Extensive experiments have been conducted on both our dataset and public Mushroom dataset, demonstrating the effectiveness of the proposed method.

2. Preliminaries and Related Work

2.1. Preliminaries

Problem Statement The problem statement is as follows. Given a product, the goal is to determine which creative is the most attractive one and should be displayed. Meanwhile, we need to estimate the uncertainty of the predictions so as to maximize cumulative reward in a long run.

In the online advertising system, when an ad is shown to a user by displaying a candidate creative, this scenario is counted as an impression. Suppose there are products, denoted as , and each product composes a group of creatives, indicated as . For product , the objective is to find the creative that subjects to:


where denotes the CTR for a given creative. An empirical way to produce CTR is accumulating the current clicks and impressions, and produce the click ratio as:


where and indicate the click and impression number of the creative . But it may suffer from insufficient impressions, especially for the cold-start creatives. Another way is to learn a prediction function from all the historical data by considering the contextual information (i.e. the image content) as:


where takes the image content of creative as input, and learns from the historical data. The sequential data can be represented as


where is the label denotes whether a click is received. We take both the statistical data and content information into consideration. Subsection 2.2 reviews some product recommendation methods that take visual content as auxiliary information, and subsection 2.3 introduces typical bandit models to estimate uncertainty. Both of above methods will be our strong baselines.

2.2. Visual-aware Recommendation Methods

CTR prediction of image ads is a core task of online display advertising systems. Due to the recent advances in computer vision, visual features are employed to further enhance the recommendation models

(Azimi et al., 2012; Cheng et al., 2012; Mo et al., 2015; Chen et al., 2016; Wang et al., 2018; Ge et al., 2018; Yu et al., 2018; Capelo et al., 2019; Liu et al., 2020). (Azimi et al., 2012; Cheng et al., 2012) quantitatively study the relationship between handcrafted visual features and creative online performance. Different from fixed handcrafted features, (Ge et al., 2018; Yu et al., 2018; Capelo et al., 2019)

apply “off-the-shelf” visual features extracted by deep convolutional neural network

(Simonyan and Zisserman, 2015). (Mo et al., 2015; Chen et al., 2016; Wang et al., 2018) extend these methods by training the CNNs in an end-to-end manner. (Liu et al., 2020) integrate the category information on top of the CNN embedding to help visual modeling. The above works focus on improving the product ranking by considering visual information while neglecting the great potential of creative ranking. There is a few work so far to address this topic. idealo.de (portal of the German e-commerce market) adopts an aesthetic model(Esfandarani and Milanfar, 2018) to select the most attractive image for each recommended hotel. They believe that photos can be just as important for bookings as reviews. PEAC (Zhao et al., 2019) resembles our method the most and they aim to rank ad creatives based on the visual content. But it is an offline evaluation model that cannot flexibly update the ranking strategy when receiving online observations. Besides, all above methods do not model the uncertainty which may lack of exploration ability.

2.3. Multi-armed Bandit Methods

Multi-armed bandits (MAB) problem is a typical sequential decision making process that is also treated as an online decision making problems (Yang et al., 2020). A wide range of real world applications can be modeled as MAB problems, such as online recommendation system (Glowacka, 2019), online advertising (Schwartz et al., 2017) and information retrieval (Glowacka, 2017). Epsilon-greedy (François-Lavet et al., 2018), Thompson sampling (Russo and Roy, 2014) and UCB (Auer et al., 2002) are classic context-free algorithms. They use reward/cost from the environment to update their E&E policy without contextual information. It is difficult for model to quickly adjust to new creatives (arms) since the web content undergoes frequent changes. (Li et al., 2010; Agrawal and Goyal, 2013; Riquelme et al., 2018) extend these context-free methods by considering side information like user/content representations. They assume that the expected payoff of an arm is linear in its features. The main problem linear algorithms face is their lack of representational power, which they complement with accurate uncertainty estimates. A natural attempt at getting the best of both representation learning ability and accurate uncertainty estimation consists in performing a linear payoffs on top of a neural network. NeuralLinear (Riquelme et al., 2018) present a Thompson sampling based framework that simultaneously learn a data representation through neural networks and quantify the uncertainty over Bayesian linear regression. Inspired by this framework, we further improve both the neural network and bandit method that benefit our creative ranking problem.

3. Dataset Construction

In order to promote further research and comparison on creative ranking, we contribute a large-scale creative dataset to the research community. It composes creative images and sequential impression data which can be used for evaluating both visual predictions and E&E strategies. In this section, we first describe how the creatives and online feedbacks are collected in subsection 3.1. Then we provide a statistical analysis of the dataset in subsection 3.2.

3.1. Data Collection

Figure 2. Statistical Analysis of the Creative Ranking dataset. (a) summarizes some basic information, while (b) shows the number of products in terms of product categories. (c) conducts CTR analysis by comparing poor and good creatives.

We collect a large and diverse set of creatives from Alibaba display advertising platform during July 1, 2020 to August 1, 2020. The total number of impression is approximately 215 million. There are 500,827 products with 1,707,733 ad creatives. We will make this dataset publicly available for further research and comparison. The creative and online feedback collection is subject to the following constraints:

Randomized logging policy. The online system adopts randomized logging policy so that the creatives are randomly drawn to collect an unbiased dataset. Bandit algorithms learn policies through interaction data. Training or evaluation on offline data may suffer from exposure bias called ”off-policy evaluation problem” (Precup, 2000). In (Li et al., 2010), they demonstrate that if logging policy chooses each arm uniformly at random, the estimation of bandit algorithms is unbiased. Thus, for each impression of product , the policy will randomly choose a candidate creative, and gather their clicks.

Aligned creative lifetime. Due to the complexity of online environment, the CTRs vary for different time periods, even for the same creative. In practice, creatives will be newly designed or deleted, which will result to inconsistent exposure time (as Figure 3(a)). In order to avoid the noise brought by the different time intervals, we only collect the overlap period among the candidate creatives (see Figure 3(b)). Besides, the overlap should be within 5 to 14 days, which covers the creative lifetime from the cold-starting to relative stable stage. All the filtered creatives are gathered to build the sequential data.

Figure 3. Aligned creative lifetime.

Train/Validation/Test split. We randomly split the 500,827 products into 300,242 training, 100,240 validation and 100,345 test samples, with 1,026,378/340,449/340,906 creatives respectively. We treat each product as a sample, and aim to select the best creative among candidates. The proposed VAM is learned from the training set, while the bandit model HBM is deployed on the validation/test data. This setting is used to prove the effectiveness of visual predictions on the unseen products/creatives, and whether the policy can make a better posterior estimations by using online observations.

3.2. Statistical Analysis

The proposed dataset is collected from ad interaction logs across 32 days. Figure 2(a) gives a summary of our CreativeRanking dataset. It consists of 500,827 products, covering 124 categories. The min and max candidate creatives for a product is 3 and 11, while average number is 3.4. In fact, the number of candidates in the real-world scenarios far exceeds 3.4, but the offline dataset is constrained by conditions introduced by subsection 3.1. Figure 2(b) shows the number of products for top 20 categories, namely Women’s tops, Women’s bottoms, Men’s, Women’s shoes, Furniture, and so on. In Figure 2, we make further analysis about creatives for these categories. Suppose we know the CTR for each creative, we select the poorest and best creatives for each product, and accumulate their overall performance, which is visualized as grey and (grep+blue+orange) bins. We find that the CTR of a product can be extremely lifted by selecting a good creative. Specifically, a good creative is capable of lifting CTR by 148% 285% compared to the poorest candidates, while it turns to 41.5% 72.5% compared to averaged performance of all candidates (grep+blue bins).

By proposing this CreativeRanking dataset, we would like to draw more attention to this topic which benefits both the research community and website’s user experience.

Figure 4. (Better viewed in color) The overall framework of the proposed Hybrid Bandit Model with Visual Priors. It receives several candidate creatives (shown in one column on the left) and try to find the most attractive one through both Visual-aware Ranking Model (VAM) and Hybrid Bandit Model (HBM). (a) VAM is to develop a CNN model that can evaluate creatives base on their visual content. (b) According to the visual priors, HBM aims to estimate the posterior and correct the ranking strategy.

4. Method

4.1. Overview

We briefly overview the entire pipeline. Main notations used in this paper are summarized in the right panel of Figure 4. First, as shown in Figure 4(a), feature extraction network will simultaneously receive the creatives of the th product as input, and produce the -dimensional intermediate features . Then, a fully connected layer are employed to calculate the scores for them, indicated as .

Second, the list-wise ranking loss and auxiliary regression loss are introduced to guide the learning procedure. Such a multi-objective optimization helps the model not only focus on creative ranking, but also take into account the numerical range of CTR that is benefit for the following bandit model. In addition, due to the data noise that is a common problem in a real-world application, we provide several practical solutions to mitigate casual and malicious noise. Details are described in Subsection 4.2.

After the above steps, the model can evaluate the creative quality directly by its visual content, even a newly uploaded one without any history information. Then we propose a hybrid bandit model that incorporates learned as contextual information, and update the policy by interacting with online observations. As in Figure 4(b), the hybrid model combines both product-wise and creative-specific predictions which is more flexible for complex industrial data. The elaborated formulations are in Subsection 4.3.

4.2. VAM: Visual-aware Ranking Model

Given a product , we use feature extraction network to extract high-level visual representations of creatives. And a linear layer is adopted to produce the attractiveness scores for -th creative of -th product:


where are learnable parameters of the linear layer.

List-wise Ranking Loss

. To learn the relative order of creatives, we need to map a list of predicted scores and ground-truth CTRs to a permutation probability distribution, respectively, and then take a metric between these distributions as a loss function. The mapping strategy and evaluation metric should guarantee that the candidates with higher scores would be ranked higher.

(Cao et al., 2007)

proposed permutation probability and top

probability definitions. Inspired by this work, we simplify the probability of a creative being ranked on the top 1 position as


where is an exponential function. The exponential function based top-1 probability is both scale invariant and translation invariant. And the corresponding labels are


where is exponential function with temperature . Since the is about a few percent, we use to adjust the scale of the value so that make the probability of top1 sample close to 1. With Cross Entropy as metric, the loss for product becomes


Through such objective function, the model focuses on comparing the creatives within the same product. We concentrate on the top-1 probability since it is consistent with real scenarios which will display only one creative for each impression. Besides, the end-to-end training manner greatly utilizes the learning ability of deep CNNs and boosts the visual prior knowledge extraction.

Point-wise auxiliary regression Loss. In addition to the list-wise ranking loss, we expect that the point-wise regression enforce the model to produce more accurate predictions. Actually, the ranking loss function only requires the order of outputs, leaving the numerical scale of the outputs unconstrained. Since the learned representations will be adopted as prior knowledge for the bandit model in Subsection 4.3, making the outputs close to the real CTRs will significantly stabilize the bandit learning procedure. Thus we add the point-wise regression as a regularizer. The formulation is


where denotes norm. Finally, we add up both the ranking loss and the auxiliary loss to form the final loss:


where is 0.5 in our experiments.

Noise Mitigation. In both list-wise ranking and point-wise regression,

should be esitimated as ground-truth targets. But in real-world dataset, some creatives have not sufficient impression opportunities, and the estimation may suffer from huge variance. For example, a creative only get one impression, and a click is accidentally recorded from this impression, the

will be set to 1, which is inevitably unreliable. To mitigate the problem, we provide two practical solutions, namely label smoothing and weighted sampling.

Label smoothing

is an empirical Bayes method that is utilized to smoothen the CTR estimation

(Wang et al., 2011)

. Suppose the clicks are from a binomial distribution and the CTR follows a prior distribution as


where can be regarded as the prior distribution of CTRs. After observing more clicks, the conjugacy between Binomial and Beta allows us to obtain the posterior distribution and the smoothed as


where and can be yielded by using maximum likelihood estimate through all the historical data(Wang et al., 2011). Compared to the original way, the smoothed has smaller variance and benefits the training.

Weighted sampling is a sampling strategy for training process. Instead of treating each product equally, we pay more attention to the products whose impressions are adequate and the CTRs are more reliable. The sampling weights can be produced by


where is set to the logarithm of the impressions and denotes the sampling weight of product .

All above modules are integrated in a unified framework and the visual-aware ranking model focuses on learning the general visual patterns about display performance. And then the informative representations are applied as prior knowledge for the bandit algorithm.

4.3. HBM: Hybrid Bandit Model

In this section, we provide an elegant and efficient strategy that tackles the E&E dilemma by utilizing the visual priors and updating the posterior via the hybrid bandit model. Based on NeuralLinear framework (Riquelme et al., 2018), we build a Bayesian linear regression on the extracted visual representation. Assume the online feedback data is generated as follows:


where represent clicked/non-clicked data and is the extracted visual representations by VAM. Different from the deterministic weights in Equation 6, we need to learn a weight distribution with the uncertainty that benefits the E&E decision making.

are independent and identically normally distributed random variables:


According to Bayes theorem, if the prior distribution of


is conjugate to the data’s likelihood function, the posterior probability distributions can be derived analytically. And then Thompson Sampling, as known as Posterior Sampling, is able to tackles the E&E dilemma by maintaining the posterior over models and selecting creatives in proportion to the probability that they are optimal. We model the prior joint distribution of

and as:


where the is an Inverse Gamma whose prior hyperparameters are set to and

is a Gaussian distribution with the initial parameters

. Note that is initialized as the learned weights of VAM in Equation 6. It can provide a better prior hyperparameters that further enhance the performance in the cold-starting phase. We call it VAM-Warmup and the results is shown in Figure 5(b).

Because we have chosen a conjugate prior, the posterior at time

can be derived as


where is a matrix that contain the content features for previous impressions and is the feedback rewards. After updating the above parameters at -th impression, we obtain the weight distributions with uncertainty estimation. We draw the weights from the learned distribution and select the best creative for product as


The above model makes the weight distributions shared by all the products. This simple linear assumption works well for small datasets, but becomes inferior when dealing with industrial data. For example, bright and vivid colors will be more attractive for women’s top while concise colors are more proper for 3C digital accessories. In addition to this product-wise characteristic, a creative may contain a unique designed attribute that is not expressed by the shared weights. Hence, it is helpful to have weights that have both shared and non-shared components.

We extend the Equation 15 to the following hybrid model by combining product-wise and creative-specific linear terms. For creative , it can be formulated as


where and are product-wise and creative-specific parameters, and they are disjoinly optimized by Equation 18. Furthermore, we propose an fusion strategy to adaptively combine these two terms instead of the simple addition



is a sigmoid function with offset and rescale. We find that if the impressions are inadequate, the product-wise parameters are learned better because it make use of the knowledge among candidate creatives. Otherwise, the creative-specific term outperforms the shared one since the sufficient feedback observations. The above procedure is shown in Algorithm  

1. Because our hybrid model updates the parameters of each product independently, we take as example and adopt and to represent the shared and specific parameters.

Input: , product , visual representations of candidate creatives
1 Initialize the and ;
2 ;
3 ;
4 for  do
5       ;
6       for  do
7             Sample from ;
8             Sample from ;
9             Sample from ;
10             Sample from ;
11             ;
13       end for
14      ;
15       Display the creative , and get the reward;
16       Update by the historical data of product and Equation 18;
17       Update by the historical data of creative and Equation 18;
18       Set the other parameters of time as the same as previous time ;
19       ;
21 end for
Algorithm 1 Hybrid Bandit Model

The distributions describe the uncertainty in weights which is related to impressed number: if there is less data, the model relies more on the visual evaluation results; Otherwise, the likelihood will reduce the priori effect so as to converge to the observation data. In order to fit the complex industrial data, we extend the shared linear model to the hybrid version, which consider both product-level knowledge and creative-specific information, and fused by empirical attention weights.

5. Experiments

5.1. Dataset preparation and Evaluation Metrics

Dataset Preparation. The description of CreativeRanking data is presented in Section3.1. We split the data into 300,242 training, 100,240 validation and 100,345 test products, respectively. The original images, product categories and impression rewards for each creative are provided in the order of displaying. For VAM, we aggregate the number of impressions and clicks to produce by Equation 13 on training set, and train the VAM using the loss function in Equation 11. For HBM, we update the policy by providing the visual representations extracted by VAM and the impression data like Equation 4. Note the interaction and policy updating procedure (see Algorithm  1) of HBM is conducted on the test set for simulating the online situations. We record the sequential interactions and rewards to measure the performance (see Algorithm  2 and Equation 21). Validation is used for hyperparameter tuning.

In addition to our CreativeRanking data, we also evaluate the methods on a public dataset, called Mushroom. Since there is no public dataset for creative ranking yet, we test the proposed hybrid bandit model on this dataset. The Mushroom Dataset (Schlimmer, 1981) contains 22 attributes for each mushroom, and two categories: poisonous and safe. Eating a safe mushroom will receive reward while eating a poisonous mushroom delivers reward with probability and reward otherwise. Not eating will provide no reward. We follow the protocols in (Riquelme et al., 2018), and interact for 50000 rounds.

Evaluation Metrics. For CreativeRanking data, we present two evaluation metrics to measure the performance, named simulated CTR () and cumulative regret (), respectively.

Simulated CTR () is a practical metric which is quite close to the online performance. The details are shown in Algorithm  2. It replays the recorded impression data for all products. For each product, the policy will play rounds by receiving the recorded data , and selects the best creative according to the predicted scores. If the selected one is the same as the , the impression number, click number and policy itself will be updated (see line 3 to 14 in Algorithm  2).

Input: impression data, policy
1 ;
2 ;
3 for  do
4        for  do
5               Get next impression (C, y);
6               Get predicated scores by policy ;
7               ;
8               if  then
9                      ;
10                      ;
11                      update policy by data (C, y);
13               end if
15        end for
17 end for
Algorithm 2 Evaluation Metrics -

Take HBM as an example, algorithm  1 shows the online update process. To test the HBM by using offline data, we can change the action “display and update” (line 14 to 18 in Algorithm  1) to the conditioned version in the line 8 to 12 in Algorithm 2.

Cumulative regret () is commonly used for evaluating bandit models. It is defined as


where is the cumulative reward of the optimal policy, i.e., the policy that always selects the action with highest expected reward given the context (Riquelme et al., 2018). Specifically, we select the optimal creative for our dataset, and calculate the as


where should be produced by Algorithm  2 first. And the is selected by calculating in Equation 2 on the test set.

For Mushroom, we follow the definition of cumulative regret in (Riquelme et al., 2018) to evaluate the models.

5.2. Implementation details

The model was implemented with Pytorch

(Paszke et al., 2017). We adopt deep residual network (ResNet-18)(He et al., 2016)

pretrained on ImageNet classification

(Deng et al., 2009)

as backbone, and the model is finetuned with Creative Ranking task. For VAM, we use stochastic gradient descent (SGD) with a mini-batch of 64 per GPU. The learning rate is initially set to 0.01 and then gradually decreased to

. The training process lasts 30 epochs on the datasets. For HBM, we extract the feature representations

from VAM, and update the weights distribution and by using bayesian regression.

5.3. Comparison with State-of-the-art Systems

In this subsection, we show the performance of the related methods in Table 1 and Figure 5. The methods are divided into some groups: a uniform strategy, context-free bandit models, linear bandit models, neural bandit models and our proposed methods. Table 1 presents the and of all above models on both Mushroom and CreativeRanking datasets, and our methods - (NN/VAM-HBM) exhibits state-of-the-art results compared to the related models. We also conduct further analysis by showing the reward tendency of consecutive 15 days in Figure 5. Daily evaluates the model for each day independently, showing the flexibility of the policy when interacting with the feedback. And cumulative presents the cumulative rewards up to the specific day which is used to measure the overall performance.

Uniform: The baseline strategy that randomly selects an action (eat/not eat for Mushroom and one creative for CreativeRanking). Because this strategy has neither prior knowledge nor abilities of learning from the data, it gets poor performance on the test sets.

Context-free Bandit Models: Epsilon-greedy (François-Lavet et al., 2018), Thompson sampling (Russo and Roy, 2014) and Upper Confidence Bounds (UCB) approaches (Auer et al., 2002) are simple yet effective strategies to deal with the bandit problem. They rely on history impression data (click/non-click) and keep updating their strategies. However, for the cold-start stage, they might randomly choose a creative like “Uniform” strategy (orange lines in Figure 5(c) in the first few days). We find that their curves are gradually rising, but without prior information, the overall performance is inferior to the other models.

Linear Bandit Models: The linear bandit model is an extension to the context-free method by incorporating contextual information. For Mushroom, we adopt the 22 attributes to describe a mushroom, such as shape, color and so on. The is reduced when combining the side information. For CreativeRanking, we use color distribution (Azimi et al., 2012) to represent a creative, and update the linear payoff functions. From the results in Table 1(d), the linear models achieve better results than the context-free methods, but they still face the problem of lacking representational power.

Neural Bandit Models: The neural bandit models add a linear regression on top of the neural network. In Table 1, “NN” denotes fully connected layers that used for extracting mushroom representations. For CreativeRanking, all these neural models use our VAM as feature extractor, and adopt different E&E policies. Figure 5(a) reveals some interesting observations: (1) The orange and blue lines represent the E-greedy and VAM-Greedy, respectively. With the visual priors, VAM-Greedy achieves much better performance at the beginning (about 5% CTR lift), which demonstrates the effectiveness of the visual evaluation. (2) Because VAM-Greedy is a greedy strategy that lack of exploration, it becomes mediocre in the long run. When incorporating E&E model - HBM, our VAM-HBM outperforms the other baselines by a significant margin. Besides, we also use Dropout as a Bayesian approximation(Gal and Ghahramani, 2016), but it is not able estimate the uncertainty as accurate as the other policies.

Our Methods: We propose VAM-Warmup that initialize the in bandit model by learned weights in VAM. By comparing red and blue dashed lines in Figure 5(b), we find the parameters with prior distributions improves 1.7% CTR for overall performance. In addition, we extend the model by adding creative-specific parameters, named VAM-HBM, and it further enhances the model capacity and achieves the state-of-the-art result, especially the impressions for creatives become adequate (see solid red line in Figure 5(b)(c)(d)). For Mushroom dataset, in order to demonstrate the idea, we cluster the data into 2 groups by attribute “bruises”, each maintaining the individual parameters. When combining the individual and shared parameters by fusion weights in Equation 21, the model reduces the to 1.93. Note that we use the default hyperparameters provided by NeuralLinear without carefully tuning.

Mushroom CreativeRanking Evaluation Metrics Regret (%) Regret (%) sCTR (%) Uniform 100 100 2.950 Context-free Bandit Models (Orange lines) 1-4[0.8pt/2pt] E-Greedy(François-Lavet et al., 2018) 52.99 87.22 3.166 Thompson Sampling(Russo and Roy, 2014) 52.49 87.69 3.158 UCB(Auer et al., 2002) 52.42 87.04 3.169 Linear Bandit Models (Green lines) 1-4[0.8pt/2pt] LinGreedy(Riquelme et al., 2018) 14.28 91.72 3.090 LinThompson(Riquelme et al., 2018) 2.37 85.68 3.192 LinUCB(Li et al., 2010) 10.27 85.50 3.195 Neural Bandit Models (Blue lines) 1-4[0.8pt/2pt] NN/VAM-Greedy 6.68 84.11 3.219 NN/VAM-Thompson(Riquelme et al., 2018) 2.22 83.02 3.237 NN/VAM-UCB 7.51 83.91 3.222 NN/VAM-Dropout(Gal and Ghahramani, 2016) 5.57 84.32 3.215 Our Methods (red lines) 1-4[0.8pt/2pt] VAM-Warmup - 79.70 3.293 NN/VAM-HBM 1.93 78.11 3.320
Table 1. Performance comparison with state-of-the-art systems on both Mushroom and CreativeRanking test set. is Normalized with respect to the performance of Uniform.
Figure 5. Reward tendency of consecutive 15 days on CreativeRanking.
Methods Base (a) (b) (c) (d)
Point-wise Loss?
List-wise Loss?
Noise Mitigation?
2.950 3.140    3.167   3.194   3.219  
Table 2. Ablation study for each component in the VAM. are performed on the CreativeRanking test set and lift is calculated by .

5.4. Ablation Study

In this subsection, we conduct an ablation study on CreativeRanking dataset so as to validate the effectiveness of each component in the VAM, including list-wise ranking loss, point-wise auxiliary regression loss and noise mitigation. Besides, we also compare our VAM with “learning-to-rank” visual models (including aesthetic models). We show the results in Table 2 and Table 3 to demonstrate the consistent improvements.

Base in Table 2 stands for the baseline result. We adopt “uniform” strategy that randomly choose a creative among the candidates. The baseline is 2.950% for .

Method (a) and (b): Method (a) and (b) utilize point-wise (Equation 10) and list-wise loss (Equation 9) as the objective function, respectively. Although the model has never seen the products/creatives on the test set before, it has learned general patterns to identify more attractive creatives. Moreover, the ranking loss concentrates on the top-1 probability learning which is more suitable than the point-wise objective for our scenarios. The simple version (b) can improve the by .

Method (c): Method (c) combines the point-wise auxiliary regression loss with the ranking objective. It not only learns the relative order of creative quality, but also the absolute CTRs. We find it is good at fitting the real CTR distributions and achieve the better performance 3.194% (8.3% lift) for .

Method (d): Method (d) contains label smooth and weighted sampler, both of which are designed for mitigating the label noise. Weighted sampler makes the model pay more attention to the samples whose impression numbers are sufficient while label smooth aims to improve the label reliability. These two practical methods further improve the to 3.216%, lifting 9.1% in total.

Ranking Loss sCTR (%)
Pairwise Hinge Loss (Chandakkar et al., 2017) 3.170
Aesthetics Ranking Loss (Kong et al., 2016) 3.167
Triplet Loss (Schwarz et al., 2018) 3.115
Pairwise (Zhao et al., 2019) 3.188
VAM (Ours) 3.219
Table 3. Comparison with other “learn-to-rank” visual models. All above models adopt ResNet-18 as backbone.

Related Loss functions: Pair-wise and triplet loss are typical loss functions for learning to rank problems. (Chandakkar et al., 2017; Kong et al., 2016; Schwarz et al., 2018) adopt hinge loss that is used for ”maximum-margin” classification between the better candidate and the other one. It only requires the better creative to get higher score than the other one by a pre-defined margin, without consideration of the exact difference. Our loss function in Equation 9 and 10 estimate their CTR gaps and produce more accurate differences. (Zhao et al., 2019) employ (Burges et al., 2005) as their pair-wise framework. Compared to the pair-wise learning, we treat one product as a training sample and use list of creatives as instances. It is more efficient and suitable with real scenarios which will display the best creative for one impression. Thus, our method obtains the leading performance on .

In summary, the proposed list-wise method enables the model focus on learning creative qualities and obtains better generalizability. Incorporating point-wise regression and noise mitigation techniques is able to enhance the model capacity of fitting the real-world data.

Figure 6. Two typical cases that present the changing of strategies. The horizontal axis shows different creatives while the vertical axis is the probability of being displayed for creatives. “Proper priors” indicates that VAM provides right predictions and “Incorrect priors” otherwise.

5.5. Hyperparameter Settings

in Equation 11. We tune hyperparameters in the validation set. in Equation 11 is adopted to control the weight of point-wise auxiliary loss. According to the validation results (see Table 4), we take . It is consistent with our hypothesis that ranking loss should play a more important role in the creative ranking task.

in Equation 11 0.0 0.1 0.5 1.0 2.0
Validation sCTR(%) 3.15 3.15 3.17 3.16 3.13
Test sCTR(%) 3.17 3.19 3.22 3.18 3.15
Table 4. Val/Test with different in Equation 11.
in 125 150 175
30 3.27%(3.32%) 3.28%(3.33%) 3.28%(3.31%)
50 3.27%(3.31%) 3.28%(3.32%) 3.27%(3.32%)
100 3.27%(3.31%) 3.27%(3.31%) 3.27%(3.31%)
Table 5. x%(x%) denotes val(test) of different in .

of in Equation 21. control the slope and offset of in Equation 21. Optimal hyperparameters vary in different real-world platforms(e.g., offset is set to 150, around the mean impression number of each creative). We find the final performance is not sensitive to these hyperparameters (see Table 5). We choose and in our experiments.

5.6. Case Study

Figure 7. Visualization of the learned VAM. The model pays attention to different regions adaptively, including products, models and the text on the creative.

Strategy Visualization. We show two typical cases that exhibit the changing of strategies. Figure 6 (a) shows the proper prior of HBM. We believe that the best creative should have the largest displaying probability among candidates. If this expectation is satisfied, a blue bar is shown; otherwise, orange bars are shown. It grants most impression opportunities to creative C-5 from the first day, while the other two methods spend 2 days to find the best creative. For another case that receives incorrect prior in Figure 6(b), the HBM adjusts the decision by considering the online feedback. The interactions help to revise the prior knowledge and fit to the real-world feedback. Form this comparison, we find the HBM makes good use of visual priors, and adjusts flexibly according to the feedback signals.

CNN Visualization. Besides ranking performance, we would like to attain further insight into the learned VAM. To this end, we visualize the response of our VAM according to the activations on the high-level feature maps, and the resulting visualization is shown in Figure 7. By learning from the creative ranking, we find that the CNN pays attention to different regions adaptively, including products, models and the text on the creative. As shown in the second row Figure 7, the VAM draw higher attention to the models rather than the products. It may caused by the reason that products endorsed by celebrities are more attractive than simply displaying the products. Besides, some textual information, such as description and discount information, can also attract customers.

6. Conclusions

In this paper, we propose a hybrid bandit model with visual priors. To the best of our knowledge, this is the first time that formulates the creative ranking as a E&E problem with visual priors. The VAM adopts a list-wise ranking loss function for ordering the creative quality only by their contents. In addition to the ability of visual evaluation, we extend the model to be updated when receiving feedback from online scenarios called HBM. Last but not the least, we construct and release a novel large-scale creative dataset named CreativeRanking. We would like to draw more attention to this topic which benefits both the research community and website’s user experience. We carried out extensive experiments, including performance comparison, ablation study and case study, demonstrating the solid improvements of the proposed model.


  • S. Agrawal and N. Goyal (2013) Thompson sampling for contextual bandits with linear payoffs. In

    Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013

    JMLR Workshop and Conference Proceedings, Vol. 28, pp. 127–135. External Links: Link Cited by: §2.3.
  • P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47 (2-3), pp. 235–256. External Links: Link, Document Cited by: §1, §2.3, §5.3, Table 1.
  • J. Azimi, R. Zhang, Y. Zhou, V. Navalpakkam, J. Mao, and X. Z. Fern (2012) The impact of visual appearance on user response in online display advertising. In Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 16-20, 2012 (Companion Volume), A. Mille, F. L. Gandon, J. Misselis, M. Rabinovich, and S. Staab (Eds.), pp. 457–458. External Links: Link, Document Cited by: §1, §2.2, §5.3.
  • C. J. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. N. Hullender (2005) Learning to rank using gradient descent. In Machine Learning, Proceedings of the Twenty-Second International Conference (ICML 2005), Bonn, Germany, August 7-11, 2005, L. D. Raedt and S. Wrobel (Eds.), ACM International Conference Proceeding Series, Vol. 119, pp. 89–96. Cited by: §5.4.
  • Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In Machine Learning, Proceedings of the Twenty-Fourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 20-24, 2007, Z. Ghahramani (Ed.), ACM International Conference Proceeding Series, Vol. 227, pp. 129–136. External Links: Link, Document Cited by: §4.2.
  • M. Capelo, K. Aggarwal, and P. Yadav (2019) Combining text and image data for product recommendability modeling. In 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, December 9-12, 2019, pp. 5992–5994. External Links: Link, Document Cited by: §2.2.
  • P. S. Chandakkar, V. Gattupalli, and B. Li (2017) A computational approach to relative aesthetics. CoRR abs/1704.01248. External Links: Link, 1704.01248 Cited by: §5.4, Table 3.
  • J. Chen, B. Sun, H. Li, H. Lu, and X. Hua (2016) Deep CTR prediction in display advertising. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, A. Hanjalic, C. Snoek, M. Worring, D. C. A. Bulterman, B. Huet, A. Kelliher, Y. Kompatsiaris, and J. Li (Eds.), pp. 811–820. External Links: Link, Document Cited by: §1, §1, §2.2.
  • H. Cheng, R. van Zwol, J. Azimi, E. Manavoglu, R. Zhang, Y. Zhou, and V. Navalpakkam (2012) Multimedia features for click prediction of new ads in display advertising. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, Q. Yang, D. Agarwal, and J. Pei (Eds.), pp. 777–785. Cited by: §1, §2.2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In

    2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA

    pp. 248–255. External Links: Link, Document Cited by: §5.2.
  • H. T. Esfandarani and P. Milanfar (2018) NIMA: neural image assessment. IEEE Trans. Image Process. 27 (8), pp. 3998–4011. External Links: Link, Document Cited by: §2.2.
  • V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, and J. Pineau (2018)

    An introduction to deep reinforcement learning

    Found. Trends Mach. Learn. 11 (3-4), pp. 219–354. External Links: Link, Document Cited by: §1, §2.3, §5.3, Table 1.
  • Y. Gal and Z. Ghahramani (2016)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1050–1059. External Links: Link Cited by: §5.3, Table 1.
  • T. Ge, L. Zhao, G. Zhou, K. Chen, S. Liu, H. Yi, Z. Hu, B. Liu, P. Sun, H. Liu, et al. (2018) Image matters: visually modeling user behaviors using advanced model server. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2087–2095. Cited by: §2.2.
  • D. Glowacka (2017) Bandit algorithms in interactive information retrieval. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2017, Amsterdam, The Netherlands, October 1-4, 2017, J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.), pp. 327–328. External Links: Link, Document Cited by: §2.3.
  • D. Glowacka (2019) Bandit algorithms in recommender systems. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 574–575. Cited by: §2.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  • S. Kong, X. Shen, Z. L. Lin, R. Mech, and C. C. Fowlkes (2016) Photo aesthetics ranking network with attributes and content adaptation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9905, pp. 662–679. External Links: Link, Document Cited by: §5.4, Table 3.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010, M. Rappa, P. Jones, J. Freire, and S. Chakrabarti (Eds.), pp. 661–670. External Links: Link, Document Cited by: §2.3, §3.1, Table 1.
  • H. Liu, J. Lu, H. Yang, X. Zhao, S. Xu, H. Peng, Z. Zhang, W. Niu, X. Zhu, Y. Bao, et al. (2020) Category-specific cnn for visual-aware ctr prediction at jd. com. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2686–2696. Cited by: §2.2.
  • K. Mo, B. Liu, L. Xiao, Y. Li, and J. Jiang (2015) Image feature learning for cold start problem in display advertising. In

    Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015

    , Q. Yang and M. J. Wooldridge (Eds.),
    pp. 3728–3734. External Links: Link Cited by: §1, §2.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §5.2.
  • D. Precup (2000) Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §3.1.
  • C. Riquelme, G. Tucker, and J. Snoek (2018) Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §2.3, §4.3, §5.1, §5.1, §5.1, Table 1.
  • D. Russo and B. V. Roy (2014) Learning to optimize via posterior sampling. Math. Oper. Res. 39 (4), pp. 1221–1243. External Links: Link, Document Cited by: §1, §2.3, §5.3, Table 1.
  • J. Schlimmer (1981) Mushroom records drawn from the audubon society field guide to north american mushrooms. GH Lincoff (Pres), New York. Cited by: §5.1.
  • E. M. Schwartz, E. T. Bradlow, and P. S. Fader (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science 36 (4), pp. 500–522. Cited by: §2.3.
  • K. Schwarz, P. Wieschollek, and H. P. Lensch (2018) Will people like your image? learning the aesthetic space. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2048–2057. Cited by: §5.4, Table 3.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.2.
  • X. Wang, W. Li, Y. Cui, R. Zhang, and J. Mao (2011) Click-through rate estimation for rare events in online advertising. In Online multimedia advertising: Techniques and technologies, pp. 1–12. Cited by: §4.2.
  • Y. Wang, J. Xu, A. Wu, M. Li, Y. He, J. Hu, and W. P. Yan (2018) Telepath: understanding users from a human vision perspective in large-scale recommender systems. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.2.
  • M. Yang, Q. Li, Z. Qin, and J. Ye (2020) Hierarchical adaptive contextual bandits for resource constraint based recommendation. In Proceedings of The Web Conference 2020, pp. 292–302. Cited by: §2.3.
  • W. Yu, H. Zhang, X. He, X. Chen, L. Xiong, and Z. Qin (2018) Aesthetic-based clothing recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27, 2018, P. Champin, F. L. Gandon, M. Lalmas, and P. G. Ipeirotis (Eds.), pp. 649–658. External Links: Link, Document Cited by: §2.2.
  • Z. Zhao, L. Li, B. Zhang, M. Wang, Y. Jiang, L. Xu, F. Wang, and W. Ma (2019) What you look matters?: offline evaluation of advertising creatives for cold-start problem. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 3-7, 2019, W. Zhu, D. Tao, X. Cheng, P. Cui, E. A. Rundensteiner, D. Carmel, Q. He, and J. X. Yu (Eds.), pp. 2605–2613. External Links: Link, Document Cited by: §2.2, §5.4, Table 3.