1. Introduction
Online display advertising is a rapid growing business and has become an important source of revenue for Internet service providers. The advertisements are delivered to customers through various online channels, e.g. ecommerce platform. Image ads are the most widely used format since they are more compact, intuitive and comprehensible (Chen et al., 2016). In Figure 1, each row composes several ad images that describe the same product for comprehensive demonstrations. These images are called creatives. Although the creatives represent the same product, they may have largely different CTRs due to the visual appearance. Thus it is crucial to display the most appealing design to attract the potentially interested customers and maximize the ClickThrough Rate(CTR).
In order to explore the most appealing creative, all of the candidates should be displayed to customers. Meanwhile, to ensure the overall performance of advertising, we prefer to display the creative that has the highest predicted CTR so far. This procedure can be modeled as a typical multiarmed bandit problem (MAB). It not only focuses on maximizing cumulative rewards (clicks) but also balance the explorationexploitation(E&E) tradeoff within a limited exploration resource so that CTR are considered. Epsilongreedy (FrançoisLavet et al., 2018)
(Russo and Roy, 2014) and Upper Confidence Bounds (UCB) approaches (Auer et al., 2002) are widely used strategies to deal with the bandit problem. However, creatives potentially change more frequently than products, and most of them cannot have sufficient impression opportunities to get reliable CTRs throughout their lifetime. So the conventional bandit models may suffer from coldstart problem in the initial random exploration period, hurting the online performance extremely. One potential solution to this problem is incorporating visual prior knowledge to facilitate a better exploration. (Azimi et al., 2012; Cheng et al., 2012; Mo et al., 2015; Chen et al., 2016)consider the visual features extracted by deep convolutional networks and make deterministic selections for product recommendation. These deep models are in a heavy computation and cannot be flexibly updated online. Besides, the deterministic and greedy strategy may result in suboptimal solution due to the lack of exploration. Consequently, how to combine both the expressive visual representations and flexible bandit model remains a challenging problem.
In this paper, we propose an elegant method which incorporates visual prior knowledge into bandit model for facilitating a better exploration. It is based on a framework called NeuralLinear (Riquelme et al., 2018)
. They consider approximate bayesian neural networks in a Thompson sampling framework to utilize both the learning ability of neural networks and the posterior sampling method. By adopting this general framework, we first present a novel convolutional network with a listwise ranking loss function to select the most attractive creative. The ranking loss concentrates on capturing the visual patterns related to attractiveness, and the learned representations are treated as contextual information for the bandit model. Second, in terms of the bandit model, we make two major improvements: 1) Instead of randomly setting a prior hyperparameter to candidate arms, we use the weights of neural network to initialize the bandit parameters that further enhance the performance in the coldstarting phase. 2) To fit the industrialscale data, we extend the linear regression model of NeuralLinear to a hybrid model which adopts two individual parameters, i.e. productwise ones and creativespecific ones. The two components are adaptively combined during the exploring period. Last but not the least, because the creative ranking is a novel problem, it lacks realworld data for further study and comparison. To this end, we contribute a largescale creative dataset
^{1}^{1}1The Data and code are publicly available at https://github.com/alimamacreative/A_Hybrid_Bandit_Model_with_Visual_Priors_for_Creative_Ranking.git from Alibaba display advertising platform that comprises more than 500k products and 1.7M ad creatives.In summary, the contributions of this paper include:
 We present a visualaware ranking model (called VAM) that is capable of evaluating new creatives according to the visual appearance.
 Regarding the learned visual predictions as a prior, the improved hybrid bandit model (called HBM) is proposed to make better posteriori estimations by taking more observations into consideration.
 We construct a novel largescale creative dataset named CreativeRanking^{1}. Extensive experiments have been conducted on both our dataset and public Mushroom dataset, demonstrating the effectiveness of the proposed method.
2. Preliminaries and Related Work
2.1. Preliminaries
Problem Statement The problem statement is as follows. Given a product, the goal is to determine which creative is the most attractive one and should be displayed. Meanwhile, we need to estimate the uncertainty of the predictions so as to maximize cumulative reward in a long run.
In the online advertising system, when an ad is shown to a user by displaying a candidate creative, this scenario is counted as an impression. Suppose there are products, denoted as , and each product composes a group of creatives, indicated as . For product , the objective is to find the creative that subjects to:
(1) 
where denotes the CTR for a given creative. An empirical way to produce CTR is accumulating the current clicks and impressions, and produce the click ratio as:
(2) 
where and indicate the click and impression number of the creative . But it may suffer from insufficient impressions, especially for the coldstart creatives. Another way is to learn a prediction function from all the historical data by considering the contextual information (i.e. the image content) as:
(3) 
where takes the image content of creative as input, and learns from the historical data. The sequential data can be represented as
(4) 
where is the label denotes whether a click is received. We take both the statistical data and content information into consideration. Subsection 2.2 reviews some product recommendation methods that take visual content as auxiliary information, and subsection 2.3 introduces typical bandit models to estimate uncertainty. Both of above methods will be our strong baselines.
2.2. Visualaware Recommendation Methods
CTR prediction of image ads is a core task of online display advertising systems. Due to the recent advances in computer vision, visual features are employed to further enhance the recommendation models
(Azimi et al., 2012; Cheng et al., 2012; Mo et al., 2015; Chen et al., 2016; Wang et al., 2018; Ge et al., 2018; Yu et al., 2018; Capelo et al., 2019; Liu et al., 2020). (Azimi et al., 2012; Cheng et al., 2012) quantitatively study the relationship between handcrafted visual features and creative online performance. Different from fixed handcrafted features, (Ge et al., 2018; Yu et al., 2018; Capelo et al., 2019)apply “offtheshelf” visual features extracted by deep convolutional neural network
(Simonyan and Zisserman, 2015). (Mo et al., 2015; Chen et al., 2016; Wang et al., 2018) extend these methods by training the CNNs in an endtoend manner. (Liu et al., 2020) integrate the category information on top of the CNN embedding to help visual modeling. The above works focus on improving the product ranking by considering visual information while neglecting the great potential of creative ranking. There is a few work so far to address this topic. idealo.de (portal of the German ecommerce market) adopts an aesthetic model(Esfandarani and Milanfar, 2018) to select the most attractive image for each recommended hotel. They believe that photos can be just as important for bookings as reviews. PEAC (Zhao et al., 2019) resembles our method the most and they aim to rank ad creatives based on the visual content. But it is an offline evaluation model that cannot flexibly update the ranking strategy when receiving online observations. Besides, all above methods do not model the uncertainty which may lack of exploration ability.2.3. Multiarmed Bandit Methods
Multiarmed bandits (MAB) problem is a typical sequential decision making process that is also treated as an online decision making problems (Yang et al., 2020). A wide range of real world applications can be modeled as MAB problems, such as online recommendation system (Glowacka, 2019), online advertising (Schwartz et al., 2017) and information retrieval (Glowacka, 2017). Epsilongreedy (FrançoisLavet et al., 2018), Thompson sampling (Russo and Roy, 2014) and UCB (Auer et al., 2002) are classic contextfree algorithms. They use reward/cost from the environment to update their E&E policy without contextual information. It is difficult for model to quickly adjust to new creatives (arms) since the web content undergoes frequent changes. (Li et al., 2010; Agrawal and Goyal, 2013; Riquelme et al., 2018) extend these contextfree methods by considering side information like user/content representations. They assume that the expected payoff of an arm is linear in its features. The main problem linear algorithms face is their lack of representational power, which they complement with accurate uncertainty estimates. A natural attempt at getting the best of both representation learning ability and accurate uncertainty estimation consists in performing a linear payoffs on top of a neural network. NeuralLinear (Riquelme et al., 2018) present a Thompson sampling based framework that simultaneously learn a data representation through neural networks and quantify the uncertainty over Bayesian linear regression. Inspired by this framework, we further improve both the neural network and bandit method that benefit our creative ranking problem.
3. Dataset Construction
In order to promote further research and comparison on creative ranking, we contribute a largescale creative dataset to the research community. It composes creative images and sequential impression data which can be used for evaluating both visual predictions and E&E strategies. In this section, we first describe how the creatives and online feedbacks are collected in subsection 3.1. Then we provide a statistical analysis of the dataset in subsection 3.2.
3.1. Data Collection
We collect a large and diverse set of creatives from Alibaba display advertising platform during July 1, 2020 to August 1, 2020. The total number of impression is approximately 215 million. There are 500,827 products with 1,707,733 ad creatives. We will make this dataset publicly available for further research and comparison. The creative and online feedback collection is subject to the following constraints:
Randomized logging policy. The online system adopts randomized logging policy so that the creatives are randomly drawn to collect an unbiased dataset. Bandit algorithms learn policies through interaction data. Training or evaluation on offline data may suffer from exposure bias called ”offpolicy evaluation problem” (Precup, 2000). In (Li et al., 2010), they demonstrate that if logging policy chooses each arm uniformly at random, the estimation of bandit algorithms is unbiased. Thus, for each impression of product , the policy will randomly choose a candidate creative, and gather their clicks.
Aligned creative lifetime. Due to the complexity of online environment, the CTRs vary for different time periods, even for the same creative. In practice, creatives will be newly designed or deleted, which will result to inconsistent exposure time (as Figure 3(a)). In order to avoid the noise brought by the different time intervals, we only collect the overlap period among the candidate creatives (see Figure 3(b)). Besides, the overlap should be within 5 to 14 days, which covers the creative lifetime from the coldstarting to relative stable stage. All the filtered creatives are gathered to build the sequential data.
Train/Validation/Test split. We randomly split the 500,827 products into 300,242 training, 100,240 validation and 100,345 test samples, with 1,026,378/340,449/340,906 creatives respectively. We treat each product as a sample, and aim to select the best creative among candidates. The proposed VAM is learned from the training set, while the bandit model HBM is deployed on the validation/test data. This setting is used to prove the effectiveness of visual predictions on the unseen products/creatives, and whether the policy can make a better posterior estimations by using online observations.
3.2. Statistical Analysis
The proposed dataset is collected from ad interaction logs across 32 days. Figure 2(a) gives a summary of our CreativeRanking dataset. It consists of 500,827 products, covering 124 categories. The min and max candidate creatives for a product is 3 and 11, while average number is 3.4. In fact, the number of candidates in the realworld scenarios far exceeds 3.4, but the offline dataset is constrained by conditions introduced by subsection 3.1. Figure 2(b) shows the number of products for top 20 categories, namely Women’s tops, Women’s bottoms, Men’s, Women’s shoes, Furniture, and so on. In Figure 2, we make further analysis about creatives for these categories. Suppose we know the CTR for each creative, we select the poorest and best creatives for each product, and accumulate their overall performance, which is visualized as grey and (grep+blue+orange) bins. We find that the CTR of a product can be extremely lifted by selecting a good creative. Specifically, a good creative is capable of lifting CTR by 148% 285% compared to the poorest candidates, while it turns to 41.5% 72.5% compared to averaged performance of all candidates (grep+blue bins).
By proposing this CreativeRanking dataset, we would like to draw more attention to this topic which benefits both the research community and website’s user experience.
4. Method
4.1. Overview
We briefly overview the entire pipeline. Main notations used in this paper are summarized in the right panel of Figure 4. First, as shown in Figure 4(a), feature extraction network will simultaneously receive the creatives of the th product as input, and produce the dimensional intermediate features . Then, a fully connected layer are employed to calculate the scores for them, indicated as .
Second, the listwise ranking loss and auxiliary regression loss are introduced to guide the learning procedure. Such a multiobjective optimization helps the model not only focus on creative ranking, but also take into account the numerical range of CTR that is benefit for the following bandit model. In addition, due to the data noise that is a common problem in a realworld application, we provide several practical solutions to mitigate casual and malicious noise. Details are described in Subsection 4.2.
After the above steps, the model can evaluate the creative quality directly by its visual content, even a newly uploaded one without any history information. Then we propose a hybrid bandit model that incorporates learned as contextual information, and update the policy by interacting with online observations. As in Figure 4(b), the hybrid model combines both productwise and creativespecific predictions which is more flexible for complex industrial data. The elaborated formulations are in Subsection 4.3.
4.2. VAM: Visualaware Ranking Model
Given a product , we use feature extraction network to extract highlevel visual representations of creatives. And a linear layer is adopted to produce the attractiveness scores for th creative of th product:
(5) 
(6) 
where are learnable parameters of the linear layer.
Listwise Ranking Loss
. To learn the relative order of creatives, we need to map a list of predicted scores and groundtruth CTRs to a permutation probability distribution, respectively, and then take a metric between these distributions as a loss function. The mapping strategy and evaluation metric should guarantee that the candidates with higher scores would be ranked higher.
(Cao et al., 2007)proposed permutation probability and top
probability definitions. Inspired by this work, we simplify the probability of a creative being ranked on the top 1 position as(7) 
where is an exponential function. The exponential function based top1 probability is both scale invariant and translation invariant. And the corresponding labels are
(8) 
where is exponential function with temperature . Since the is about a few percent, we use to adjust the scale of the value so that make the probability of top1 sample close to 1. With Cross Entropy as metric, the loss for product becomes
(9) 
Through such objective function, the model focuses on comparing the creatives within the same product. We concentrate on the top1 probability since it is consistent with real scenarios which will display only one creative for each impression. Besides, the endtoend training manner greatly utilizes the learning ability of deep CNNs and boosts the visual prior knowledge extraction.
Pointwise auxiliary regression Loss. In addition to the listwise ranking loss, we expect that the pointwise regression enforce the model to produce more accurate predictions. Actually, the ranking loss function only requires the order of outputs, leaving the numerical scale of the outputs unconstrained. Since the learned representations will be adopted as prior knowledge for the bandit model in Subsection 4.3, making the outputs close to the real CTRs will significantly stabilize the bandit learning procedure. Thus we add the pointwise regression as a regularizer. The formulation is
(10) 
where denotes norm. Finally, we add up both the ranking loss and the auxiliary loss to form the final loss:
(11) 
where is 0.5 in our experiments.
Noise Mitigation. In both listwise ranking and pointwise regression,
should be esitimated as groundtruth targets. But in realworld dataset, some creatives have not sufficient impression opportunities, and the estimation may suffer from huge variance. For example, a creative only get one impression, and a click is accidentally recorded from this impression, the
will be set to 1, which is inevitably unreliable. To mitigate the problem, we provide two practical solutions, namely label smoothing and weighted sampling.Label smoothing
is an empirical Bayes method that is utilized to smoothen the CTR estimation
(Wang et al., 2011). Suppose the clicks are from a binomial distribution and the CTR follows a prior distribution as
(12) 
where can be regarded as the prior distribution of CTRs. After observing more clicks, the conjugacy between Binomial and Beta allows us to obtain the posterior distribution and the smoothed as
(13) 
where and can be yielded by using maximum likelihood estimate through all the historical data(Wang et al., 2011). Compared to the original way, the smoothed has smaller variance and benefits the training.
Weighted sampling is a sampling strategy for training process. Instead of treating each product equally, we pay more attention to the products whose impressions are adequate and the CTRs are more reliable. The sampling weights can be produced by
(14) 
where is set to the logarithm of the impressions and denotes the sampling weight of product .
All above modules are integrated in a unified framework and the visualaware ranking model focuses on learning the general visual patterns about display performance. And then the informative representations are applied as prior knowledge for the bandit algorithm.
4.3. HBM: Hybrid Bandit Model
In this section, we provide an elegant and efficient strategy that tackles the E&E dilemma by utilizing the visual priors and updating the posterior via the hybrid bandit model. Based on NeuralLinear framework (Riquelme et al., 2018), we build a Bayesian linear regression on the extracted visual representation. Assume the online feedback data is generated as follows:
(15) 
where represent clicked/nonclicked data and is the extracted visual representations by VAM. Different from the deterministic weights in Equation 6, we need to learn a weight distribution with the uncertainty that benefits the E&E decision making.
are independent and identically normally distributed random variables:
(16) 
According to Bayes theorem, if the prior distribution of
andis conjugate to the data’s likelihood function, the posterior probability distributions can be derived analytically. And then Thompson Sampling, as known as Posterior Sampling, is able to tackles the E&E dilemma by maintaining the posterior over models and selecting creatives in proportion to the probability that they are optimal. We model the prior joint distribution of
and as:(17) 
where the is an Inverse Gamma whose prior hyperparameters are set to and
is a Gaussian distribution with the initial parameters
. Note that is initialized as the learned weights of VAM in Equation 6. It can provide a better prior hyperparameters that further enhance the performance in the coldstarting phase. We call it VAMWarmup and the results is shown in Figure 5(b).Because we have chosen a conjugate prior, the posterior at time
can be derived as(18) 
where is a matrix that contain the content features for previous impressions and is the feedback rewards. After updating the above parameters at th impression, we obtain the weight distributions with uncertainty estimation. We draw the weights from the learned distribution and select the best creative for product as
(19) 
The above model makes the weight distributions shared by all the products. This simple linear assumption works well for small datasets, but becomes inferior when dealing with industrial data. For example, bright and vivid colors will be more attractive for women’s top while concise colors are more proper for 3C digital accessories. In addition to this productwise characteristic, a creative may contain a unique designed attribute that is not expressed by the shared weights. Hence, it is helpful to have weights that have both shared and nonshared components.
We extend the Equation 15 to the following hybrid model by combining productwise and creativespecific linear terms. For creative , it can be formulated as
(20) 
where and are productwise and creativespecific parameters, and they are disjoinly optimized by Equation 18. Furthermore, we propose an fusion strategy to adaptively combine these two terms instead of the simple addition
(21) 
where
is a sigmoid function with offset and rescale. We find that if the impressions are inadequate, the productwise parameters are learned better because it make use of the knowledge among candidate creatives. Otherwise, the creativespecific term outperforms the shared one since the sufficient feedback observations. The above procedure is shown in Algorithm
1. Because our hybrid model updates the parameters of each product independently, we take as example and adopt and to represent the shared and specific parameters.The distributions describe the uncertainty in weights which is related to impressed number: if there is less data, the model relies more on the visual evaluation results; Otherwise, the likelihood will reduce the priori effect so as to converge to the observation data. In order to fit the complex industrial data, we extend the shared linear model to the hybrid version, which consider both productlevel knowledge and creativespecific information, and fused by empirical attention weights.
5. Experiments
5.1. Dataset preparation and Evaluation Metrics
Dataset Preparation. The description of CreativeRanking data is presented in Section3.1. We split the data into 300,242 training, 100,240 validation and 100,345 test products, respectively. The original images, product categories and impression rewards for each creative are provided in the order of displaying. For VAM, we aggregate the number of impressions and clicks to produce by Equation 13 on training set, and train the VAM using the loss function in Equation 11. For HBM, we update the policy by providing the visual representations extracted by VAM and the impression data like Equation 4. Note the interaction and policy updating procedure (see Algorithm 1) of HBM is conducted on the test set for simulating the online situations. We record the sequential interactions and rewards to measure the performance (see Algorithm 2 and Equation 21). Validation is used for hyperparameter tuning.
In addition to our CreativeRanking data, we also evaluate the methods on a public dataset, called Mushroom. Since there is no public dataset for creative ranking yet, we test the proposed hybrid bandit model on this dataset. The Mushroom Dataset (Schlimmer, 1981) contains 22 attributes for each mushroom, and two categories: poisonous and safe. Eating a safe mushroom will receive reward while eating a poisonous mushroom delivers reward with probability and reward otherwise. Not eating will provide no reward. We follow the protocols in (Riquelme et al., 2018), and interact for 50000 rounds.
Evaluation Metrics. For CreativeRanking data, we present two evaluation metrics to measure the performance, named simulated CTR () and cumulative regret (), respectively.
Simulated CTR () is a practical metric which is quite close to the online performance. The details are shown in Algorithm 2. It replays the recorded impression data for all products. For each product, the policy will play rounds by receiving the recorded data , and selects the best creative according to the predicted scores. If the selected one is the same as the , the impression number, click number and policy itself will be updated (see line 3 to 14 in Algorithm 2).
Take HBM as an example, algorithm 1 shows the online update process. To test the HBM by using offline data, we can change the action “display and update” (line 14 to 18 in Algorithm 1) to the conditioned version in the line 8 to 12 in Algorithm 2.
Cumulative regret () is commonly used for evaluating bandit models. It is defined as
(22) 
where is the cumulative reward of the optimal policy, i.e., the policy that always selects the action with highest expected reward given the context (Riquelme et al., 2018). Specifically, we select the optimal creative for our dataset, and calculate the as
(23) 
where should be produced by Algorithm 2 first. And the is selected by calculating in Equation 2 on the test set.
For Mushroom, we follow the definition of cumulative regret in (Riquelme et al., 2018) to evaluate the models.
5.2. Implementation details
The model was implemented with Pytorch
(Paszke et al., 2017). We adopt deep residual network (ResNet18)(He et al., 2016)pretrained on ImageNet classification
(Deng et al., 2009)as backbone, and the model is finetuned with Creative Ranking task. For VAM, we use stochastic gradient descent (SGD) with a minibatch of 64 per GPU. The learning rate is initially set to 0.01 and then gradually decreased to
. The training process lasts 30 epochs on the datasets. For HBM, we extract the feature representations
from VAM, and update the weights distribution and by using bayesian regression.5.3. Comparison with Stateoftheart Systems
In this subsection, we show the performance of the related methods in Table 1 and Figure 5. The methods are divided into some groups: a uniform strategy, contextfree bandit models, linear bandit models, neural bandit models and our proposed methods. Table 1 presents the and of all above models on both Mushroom and CreativeRanking datasets, and our methods  (NN/VAMHBM) exhibits stateoftheart results compared to the related models. We also conduct further analysis by showing the reward tendency of consecutive 15 days in Figure 5. Daily evaluates the model for each day independently, showing the flexibility of the policy when interacting with the feedback. And cumulative presents the cumulative rewards up to the specific day which is used to measure the overall performance.
Uniform: The baseline strategy that randomly selects an action (eat/not eat for Mushroom and one creative for CreativeRanking). Because this strategy has neither prior knowledge nor abilities of learning from the data, it gets poor performance on the test sets.
Contextfree Bandit Models: Epsilongreedy (FrançoisLavet et al., 2018), Thompson sampling (Russo and Roy, 2014) and Upper Confidence Bounds (UCB) approaches (Auer et al., 2002) are simple yet effective strategies to deal with the bandit problem. They rely on history impression data (click/nonclick) and keep updating their strategies. However, for the coldstart stage, they might randomly choose a creative like “Uniform” strategy (orange lines in Figure 5(c) in the first few days). We find that their curves are gradually rising, but without prior information, the overall performance is inferior to the other models.
Linear Bandit Models: The linear bandit model is an extension to the contextfree method by incorporating contextual information. For Mushroom, we adopt the 22 attributes to describe a mushroom, such as shape, color and so on. The is reduced when combining the side information. For CreativeRanking, we use color distribution (Azimi et al., 2012) to represent a creative, and update the linear payoff functions. From the results in Table 1(d), the linear models achieve better results than the contextfree methods, but they still face the problem of lacking representational power.
Neural Bandit Models: The neural bandit models add a linear regression on top of the neural network. In Table 1, “NN” denotes fully connected layers that used for extracting mushroom representations. For CreativeRanking, all these neural models use our VAM as feature extractor, and adopt different E&E policies. Figure 5(a) reveals some interesting observations: (1) The orange and blue lines represent the Egreedy and VAMGreedy, respectively. With the visual priors, VAMGreedy achieves much better performance at the beginning (about 5% CTR lift), which demonstrates the effectiveness of the visual evaluation. (2) Because VAMGreedy is a greedy strategy that lack of exploration, it becomes mediocre in the long run. When incorporating E&E model  HBM, our VAMHBM outperforms the other baselines by a significant margin. Besides, we also use Dropout as a Bayesian approximation(Gal and Ghahramani, 2016), but it is not able estimate the uncertainty as accurate as the other policies.
Our Methods: We propose VAMWarmup that initialize the in bandit model by learned weights in VAM. By comparing red and blue dashed lines in Figure 5(b), we find the parameters with prior distributions improves 1.7% CTR for overall performance. In addition, we extend the model by adding creativespecific parameters, named VAMHBM, and it further enhances the model capacity and achieves the stateoftheart result, especially the impressions for creatives become adequate (see solid red line in Figure 5(b)(c)(d)). For Mushroom dataset, in order to demonstrate the idea, we cluster the data into 2 groups by attribute “bruises”, each maintaining the individual parameters. When combining the individual and shared parameters by fusion weights in Equation 21, the model reduces the to 1.93. Note that we use the default hyperparameters provided by NeuralLinear without carefully tuning.
Methods  Base  (a)  (b)  (c)  (d) 

Pointwise Loss?  
Listwise Loss?  
Noise Mitigation?  
2.950  3.140  3.167  3.194  3.219 
5.4. Ablation Study
In this subsection, we conduct an ablation study on CreativeRanking dataset so as to validate the effectiveness of each component in the VAM, including listwise ranking loss, pointwise auxiliary regression loss and noise mitigation. Besides, we also compare our VAM with “learningtorank” visual models (including aesthetic models). We show the results in Table 2 and Table 3 to demonstrate the consistent improvements.
Base in Table 2 stands for the baseline result. We adopt “uniform” strategy that randomly choose a creative among the candidates. The baseline is 2.950% for .
Method (a) and (b): Method (a) and (b) utilize pointwise (Equation 10) and listwise loss (Equation 9) as the objective function, respectively. Although the model has never seen the products/creatives on the test set before, it has learned general patterns to identify more attractive creatives. Moreover, the ranking loss concentrates on the top1 probability learning which is more suitable than the pointwise objective for our scenarios. The simple version (b) can improve the by .
Method (c): Method (c) combines the pointwise auxiliary regression loss with the ranking objective. It not only learns the relative order of creative quality, but also the absolute CTRs. We find it is good at fitting the real CTR distributions and achieve the better performance 3.194% (8.3% lift) for .
Method (d): Method (d) contains label smooth and weighted sampler, both of which are designed for mitigating the label noise. Weighted sampler makes the model pay more attention to the samples whose impression numbers are sufficient while label smooth aims to improve the label reliability. These two practical methods further improve the to 3.216%, lifting 9.1% in total.
Ranking Loss  sCTR (%) 

Pairwise Hinge Loss (Chandakkar et al., 2017)  3.170 
Aesthetics Ranking Loss (Kong et al., 2016)  3.167 
Triplet Loss (Schwarz et al., 2018)  3.115 
Pairwise (Zhao et al., 2019)  3.188 
VAM (Ours)  3.219 
Related Loss functions: Pairwise and triplet loss are typical loss functions for learning to rank problems. (Chandakkar et al., 2017; Kong et al., 2016; Schwarz et al., 2018) adopt hinge loss that is used for ”maximummargin” classification between the better candidate and the other one. It only requires the better creative to get higher score than the other one by a predefined margin, without consideration of the exact difference. Our loss function in Equation 9 and 10 estimate their CTR gaps and produce more accurate differences. (Zhao et al., 2019) employ (Burges et al., 2005) as their pairwise framework. Compared to the pairwise learning, we treat one product as a training sample and use list of creatives as instances. It is more efficient and suitable with real scenarios which will display the best creative for one impression. Thus, our method obtains the leading performance on .
In summary, the proposed listwise method enables the model focus on learning creative qualities and obtains better generalizability. Incorporating pointwise regression and noise mitigation techniques is able to enhance the model capacity of fitting the realworld data.
5.5. Hyperparameter Settings
in Equation 11. We tune hyperparameters in the validation set. in Equation 11 is adopted to control the weight of pointwise auxiliary loss. According to the validation results (see Table 4), we take . It is consistent with our hypothesis that ranking loss should play a more important role in the creative ranking task.
in Equation 11  0.0  0.1  0.5  1.0  2.0 

Validation sCTR(%)  3.15  3.15  3.17  3.16  3.13 
Test sCTR(%)  3.17  3.19  3.22  3.18  3.15 
in  125  150  175 

30  3.27%(3.32%)  3.28%(3.33%)  3.28%(3.31%) 
50  3.27%(3.31%)  3.28%(3.32%)  3.27%(3.32%) 
100  3.27%(3.31%)  3.27%(3.31%)  3.27%(3.31%) 
of in Equation 21. control the slope and offset of in Equation 21. Optimal hyperparameters vary in different realworld platforms(e.g., offset is set to 150, around the mean impression number of each creative). We find the final performance is not sensitive to these hyperparameters (see Table 5). We choose and in our experiments.
5.6. Case Study
Strategy Visualization. We show two typical cases that exhibit the changing of strategies. Figure 6 (a) shows the proper prior of HBM. We believe that the best creative should have the largest displaying probability among candidates. If this expectation is satisfied, a blue bar is shown; otherwise, orange bars are shown. It grants most impression opportunities to creative C5 from the first day, while the other two methods spend 2 days to find the best creative. For another case that receives incorrect prior in Figure 6(b), the HBM adjusts the decision by considering the online feedback. The interactions help to revise the prior knowledge and fit to the realworld feedback. Form this comparison, we find the HBM makes good use of visual priors, and adjusts flexibly according to the feedback signals.
CNN Visualization. Besides ranking performance, we would like to attain further insight into the learned VAM. To this end, we visualize the response of our VAM according to the activations on the highlevel feature maps, and the resulting visualization is shown in Figure 7. By learning from the creative ranking, we find that the CNN pays attention to different regions adaptively, including products, models and the text on the creative. As shown in the second row Figure 7, the VAM draw higher attention to the models rather than the products. It may caused by the reason that products endorsed by celebrities are more attractive than simply displaying the products. Besides, some textual information, such as description and discount information, can also attract customers.
6. Conclusions
In this paper, we propose a hybrid bandit model with visual priors. To the best of our knowledge, this is the first time that formulates the creative ranking as a E&E problem with visual priors. The VAM adopts a listwise ranking loss function for ordering the creative quality only by their contents. In addition to the ability of visual evaluation, we extend the model to be updated when receiving feedback from online scenarios called HBM. Last but not the least, we construct and release a novel largescale creative dataset named CreativeRanking. We would like to draw more attention to this topic which benefits both the research community and website’s user experience. We carried out extensive experiments, including performance comparison, ablation study and case study, demonstrating the solid improvements of the proposed model.
References

Thompson sampling for contextual bandits with linear payoffs.
In
Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 1621 June 2013
, JMLR Workshop and Conference Proceedings, Vol. 28, pp. 127–135. External Links: Link Cited by: §2.3.  Finitetime analysis of the multiarmed bandit problem. Mach. Learn. 47 (23), pp. 235–256. External Links: Link, Document Cited by: §1, §2.3, §5.3, Table 1.
 The impact of visual appearance on user response in online display advertising. In Proceedings of the 21st World Wide Web Conference, WWW 2012, Lyon, France, April 1620, 2012 (Companion Volume), A. Mille, F. L. Gandon, J. Misselis, M. Rabinovich, and S. Staab (Eds.), pp. 457–458. External Links: Link, Document Cited by: §1, §2.2, §5.3.
 Learning to rank using gradient descent. In Machine Learning, Proceedings of the TwentySecond International Conference (ICML 2005), Bonn, Germany, August 711, 2005, L. D. Raedt and S. Wrobel (Eds.), ACM International Conference Proceeding Series, Vol. 119, pp. 89–96. Cited by: §5.4.
 Learning to rank: from pairwise approach to listwise approach. In Machine Learning, Proceedings of the TwentyFourth International Conference (ICML 2007), Corvallis, Oregon, USA, June 2024, 2007, Z. Ghahramani (Ed.), ACM International Conference Proceeding Series, Vol. 227, pp. 129–136. External Links: Link, Document Cited by: §4.2.
 Combining text and image data for product recommendability modeling. In 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, December 912, 2019, pp. 5992–5994. External Links: Link, Document Cited by: §2.2.
 A computational approach to relative aesthetics. CoRR abs/1704.01248. External Links: Link, 1704.01248 Cited by: §5.4, Table 3.
 Deep CTR prediction in display advertising. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 1519, 2016, A. Hanjalic, C. Snoek, M. Worring, D. C. A. Bulterman, B. Huet, A. Kelliher, Y. Kompatsiaris, and J. Li (Eds.), pp. 811–820. External Links: Link, Document Cited by: §1, §1, §2.2.
 Multimedia features for click prediction of new ads in display advertising. In The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, Q. Yang, D. Agarwal, and J. Pei (Eds.), pp. 777–785. Cited by: §1, §2.2.

ImageNet: A largescale hierarchical image database.
In
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 2025 June 2009, Miami, Florida, USA
, pp. 248–255. External Links: Link, Document Cited by: §5.2.  NIMA: neural image assessment. IEEE Trans. Image Process. 27 (8), pp. 3998–4011. External Links: Link, Document Cited by: §2.2.

An introduction to deep reinforcement learning
. Found. Trends Mach. Learn. 11 (34), pp. 219–354. External Links: Link, Document Cited by: §1, §2.3, §5.3, Table 1. 
Dropout as a bayesian approximation: representing model uncertainty in deep learning
. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1050–1059. External Links: Link Cited by: §5.3, Table 1.  Image matters: visually modeling user behaviors using advanced model server. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2087–2095. Cited by: §2.2.
 Bandit algorithms in interactive information retrieval. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR 2017, Amsterdam, The Netherlands, October 14, 2017, J. Kamps, E. Kanoulas, M. de Rijke, H. Fang, and E. Yilmaz (Eds.), pp. 327–328. External Links: Link, Document Cited by: §2.3.
 Bandit algorithms in recommender systems. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 574–575. Cited by: §2.3.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
 Photo aesthetics ranking network with attributes and content adaptation. In Computer Vision  ECCV 2016  14th European Conference, Amsterdam, The Netherlands, October 1114, 2016, Proceedings, Part I, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9905, pp. 662–679. External Links: Link, Document Cited by: §5.4, Table 3.
 A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 2630, 2010, M. Rappa, P. Jones, J. Freire, and S. Chakrabarti (Eds.), pp. 661–670. External Links: Link, Document Cited by: §2.3, §3.1, Table 1.
 Categoryspecific cnn for visualaware ctr prediction at jd. com. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2686–2696. Cited by: §2.2.

Image feature learning for cold start problem in display advertising.
In
Proceedings of the TwentyFourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 2531, 2015
, Q. Yang and M. J. Wooldridge (Eds.), pp. 3728–3734. External Links: Link Cited by: §1, §2.2.  Automatic differentiation in pytorch. Cited by: §5.2.
 Eligibility traces for offpolicy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80. Cited by: §3.1.
 Deep bayesian bandits showdown: an empirical comparison of bayesian deep networks for thompson sampling. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §2.3, §4.3, §5.1, §5.1, §5.1, Table 1.
 Learning to optimize via posterior sampling. Math. Oper. Res. 39 (4), pp. 1221–1243. External Links: Link, Document Cited by: §1, §2.3, §5.3, Table 1.
 Mushroom records drawn from the audubon society field guide to north american mushrooms. GH Lincoff (Pres), New York. Cited by: §5.1.
 Customer acquisition via display advertising using multiarmed bandit experiments. Marketing Science 36 (4), pp. 500–522. Cited by: §2.3.
 Will people like your image? learning the aesthetic space. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2048–2057. Cited by: §5.4, Table 3.
 Very deep convolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.2.
 Clickthrough rate estimation for rare events in online advertising. In Online multimedia advertising: Techniques and technologies, pp. 1–12. Cited by: §4.2.
 Telepath: understanding users from a human vision perspective in largescale recommender systems. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.2.
 Hierarchical adaptive contextual bandits for resource constraint based recommendation. In Proceedings of The Web Conference 2020, pp. 292–302. Cited by: §2.3.
 Aestheticbased clothing recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 2327, 2018, P. Champin, F. L. Gandon, M. Lalmas, and P. G. Ipeirotis (Eds.), pp. 649–658. External Links: Link, Document Cited by: §2.2.
 What you look matters?: offline evaluation of advertising creatives for coldstart problem. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM 2019, Beijing, China, November 37, 2019, W. Zhu, D. Tao, X. Cheng, P. Cui, E. A. Rundensteiner, D. Carmel, Q. He, and J. X. Yu (Eds.), pp. 2605–2613. External Links: Link, Document Cited by: §2.2, §5.4, Table 3.
Comments
There are no comments yet.