1. Introduction
With the rapid development of the Internet, the online ecommerce platforms, such as Amazon (Linden et al., 2003) and Taobao (Zhu et al., 2017), have become the main places where people buy and sell products (Evans, 2008, 2009; Goldfarb and Tucker, 2011). Ecommerce advertising is a critical marketing tool for advertisers to efficiently deliver products to potential buyers. The revenue from advertising has become the main source of income for the ecommerce platforms (Lahaie et al., 2007; Evans, 2008, 2009; Goldfarb and Tucker, 2011).
As the largest ecommerce market in China, Taobao has established one of the most advanced online advertising systems in the world (Zhu et al., 2017), which display tens of millions of online ads through realtime bidding (RTB). Besides traditional manual bidding, the Taobao display advertising system provides advertisers with convenient advertising channels and effective optimization tools for auto bidding (Zhu et al., 2017; Wu et al., 2018; Yang et al., 2019), since the Taobao platform simultaneously plays the role of the demandside platform (DSP) and the supplyside platform (SSP) in the online advertising system (Zhu et al., 2017), and such a digital ecological closedloop system enables the ecommerce platform to collect historical information of users and advertisements at the same time. Because the advertising results achieved by the optimization tools varies with the preset optimization goals, the platform should first confirm advertisers’ marketing demands and target crowds before recommending detailed strategies (Zhang et al., 2012, 2014, 2016; Cai et al., 2017; Ren et al., 2017; Kitts et al., 2017; Maehara et al., 2018; Wu et al., 2018; Yang et al., 2019). Therefore, in the actual advertising process, before authorizing the realtime bidding system to conduct online auctions, advertisers are required to set up their ad units. The setup process is shown in Figure 1, which includes: (1) Set their advertising demands, such as the number of clicks, gross merchandise volume (GMV), etc. (2) Select targeted crowds. A targeted crowd refers to certain customers that the advertiser decides to participate in the online auction. (3) Set up bidding strategies. Advertisers can choose to either bid manually by setting a fixed price for all the impressions in a specific crowd or apply optimization tools provided by the platform to optimize their specific marketing demands. After the setup stage, the ad units would attend the realtime bidding. However, according to our observations on the Taobao display advertising platform, such a selfservice advertising process would introduce some obstacles due to the information asymmetry between advertisers and advertising platforms. On the one hand, for the manual bidding advertisers, they have no access to the realtime information of online transactions, and cannot properly set their bids and target crowds, leading to poor advertising performance and the loss of budget. On the other hand, for the auto bidding advertisers, due to the lack of information collection channels and the trivial modeling of advertisers’ demands, the advertising platform cannot accurately know and fulfill the advertiser’s real advertising demands with the optimization tools. A recently launched survey reveals that a large number of new advertisers have left the advertising platform because their marketing demands were not well satisfied (Guo et al., 2020). To solve the above problems, we propose a recommender system toward advertisers in this work and initialize a new research direction: designing appropriate recommender systems for advertisers.
An intuitive way to implement the advertiserside recommender system is to migrate the works in userside recommendation (Sarwar et al., 2001; Basilico and Hofmann, 2004; Schafer et al., 2007; Pazzani and Billsus, 2007; Lops et al., 2011; Lu et al., 2015; Cheng et al., 2016; Covington et al., 2016; Guo et al., 2017; Zhou et al., 2018; Zhu et al., 2018)
, which focuses on the matching between products and users based on the obtained correlation between the products in the entire corpus and the visiting users. However, simply migrating these works to the advertiserside would encounter the following challenges: (1) Users are shown the complete information of the products and can express their interests and preferences explicitly by clicking or purchasing products, while in existing advertising platforms, due to the lack of forecasting realtime strategy performance, advertisers cannot accurately distinguish good strategies from bad strategies, which makes the advertisers hard to express their demands. Without the ground truth of advertiser demand, we need to find another way to model and learn their demands other than standard supervised learning methods used in userside recommendation. (2) Most researches on the userside are based on a discrete corpus, which is the commodity library. However, the candidate corpus on the advertiserside is often seen as a complex high dimensional continuous space because advertisers could bid for any visiting users with different prices, generating different advertising performance. (3) There is no data yet for designing a recommender system for advertisers. Moreover, the cost for advertisers to adopt recommendations is high, so that it is expensive for the recommender system to explore advertisers’ marketing demands.
To overcome the above difficulties, we first design a prototype system to help advertisers optimize their bids and crowds. Online evaluations verify that the prototype system optimizes the advertiser’s advertising performance, and further increases the revenue of the platform. Based on the prototype system, we further design an advertising strategy recommender system for advertisers based on advertisers’ demands. To solve the first challenge, we show the predicted advertising performance to advertisers through simulated bidding, thus advertisers could know the predicted advertising performance in advance and express their preferences through adoption behaviors. For the second challenge, we model advertisers’ demands and utility formally. The modeled demand can not only reflect the advertiser’s preference for advertising performance indicators but also be used as an objective function of the advertising platform to optimize the advertiser’s advertising performance. For the last challenge, to effectively obtain advertisers’ demands through the advertiser’s profiling and interaction information, we model the strategy recommendation problem as a contextual bandit, using the advertiser’s adoption behavior as environmental feedback. To balance exploration and exploitation, we use Dropout in a neural network to learn the demands of advertisers dynamically. We summarize the contributions of this work as follows:
As far as we know, we are the first to consider the problem of strategy recommendation in the advertiser scenario. To fill the gap between understanding and optimizing advertisers’ demands, we model advertisers’ demands and their marketing optimization goals explicitly, and model the strategy recommendation problem based on advertisers’ demands as an advertiser’s personalized advertising performance optimization problem. To learn the demands of advertisers, we use advertisers’ adoption behavior as an indicator to measure the effectiveness of recommendations, which can also be used as the advertisers’ preference information for the demand learning algorithm.
To efficiently recommend strategies for advertisers, we not only need to use the existing information to optimize the advertiser’s adoption rate but also need to properly explore the advertiser’s potential demands. Therefore, we model the problem as a contextual bandit problem. We regard advertisers on the platform as an overall environment, and use a neural network as the agent to estimate the advertiser’s demand, which is treated as the action space. Advertiser adoption behavior is used as environmental feedback. The uncertainty of the neural network is expressed through Dropout, which can balance exploration and exploitation during strategy recommendation.
We deploy a system for bid price and crowd optimization on Taobao platform. The online A/B test proves the effectiveness of deploying a recommender system toward advertisers, which optimizes the advertiser’s advertising performance and increases the revenue of the platform. Then we build an offline simulation environment based on Alibaba online advertising bidding data and conduct a series of experiments, proving that the strategy recommender system we design can effectively learn the demands and improve the recommendation adoption rate of advertisers. We verify through comparative experiments that Dropout can better weigh the pros and cons between using existing demand judgment information and exploring the potential demands of advertisers.
The rest of the paper is organized as follows: In section 2, we introduce the related work about the userside recommendation and realtime bidding algorithms for realtime bidding. In section 3, we introduce the prototype system and the new strategy recommender system designed based on the prototype system. In section 4, the detailed algorithm design is illustrated. The experimental results and the conclusion are given in section 5 and section 6 respectively.
2. Related Work
2.1. Recommender System
In recent years, the recommendation problem has been extensively studied in both academia and industry. Different recommender systems have been deployed in various ecommerce applications (Cheng et al., 2016; Covington et al., 2016; Guo et al., 2017; Zhou et al., 2018; Zhu et al., 2018). The recommender system can solve the problem of information overload, help users find useful information quickly, and provide personalized services (Lu et al., 2015). Typical recommendation technologies include collaborative filtering, contentbased filtering, and hybrid filtering. Collaborative filtering (Sarwar et al., 2001; Schafer et al., 2007) is a classic method widely used in designing recommender systems. It finds the correlation between recommended products and then makes recommendations based on the correlation. Contentbased filtering (Pazzani and Billsus, 2007; Lops et al., 2011) depends on the user’s historical choices, user profiles, and product information. It recommends products to users based on the similarity between users and products. Hybrid filtering (Basilico and Hofmann, 2004) combines collaborative filtering and contentbased filtering to improve the accuracy of recommendations.
There are two main differences between the advertiser’s recommendation problem and the user’s recommendation problem. First of all, user recommendation frameworks are usually based on a unified product library for feature extraction and correlation analysis. In an ecommerce platform, all commodities recommended to users constitute the platform’s commodity library. However, in the advertiserside recommendation problem, the recommended product is an abstract bidding strategy, which is hard to analyze simply through feature extraction. Secondly, the bidding environment information that advertisers can obtain is very scarce compared to advertising platforms, and the recommendation problem of advertisers should not only solve the problem of information overload but also help advertisers optimize their advertising performance. Therefore, the advertiser’s demand learning and the design of optimization algorithms should be combined with existing realtime bidding research.
2.2. Realtime Bidding
In the realtime bidding scenario, the researchers conducted an indepth study on the problem of auto bidding to optimize the preset goals (Zhang et al., 2012; Ren et al., 2017; Maehara et al., 2018). The classic problem of realtime bidding algorithm is an online optimization problem with a single optimization goal, e.g. GMV, under budget constraint (Zhang et al., 2014, 2016; Cai et al., 2017; Wu et al., 2018). Zhang et al. proposed an offline optimal bidding strategy by bidding for each impression, where is the value of current impression, and (Zhang et al., 2016)
is a fixed value decided by the user traffic environment. Wu et al. further applied a modelfree reinforcement learning method to timely adjust the value of
in the real online scenario, aiming to pace the spend speed of the budget based on realtime environment (Wu et al., 2018). Besides, subsequent research efforts attempt to extend the single budget constraint problem to problems with various optimization goals under multiple constraints (Zhang et al., 2016; Kitts et al., 2017; Yang et al., 2019). Zhang et al. used feedback control methods to adjust bidding strategies based on realtime performance, and met the auction winning rate (AWR) and pay per click (PPC) constraints (Zhang et al., 2016). Yang et al. modeled the GMV optimization problem with budget and PPC constraints as an online linear programming problem
(Yang et al., 2019).Based on the current research, the auto bidding tools provided by the platform can well handle the optimization problem as long as the optimization goal is known. However, in a display advertising system, advertisers often have a variety of specific demands for different ads, leading to different optimization goals. For example, advertisers may be more inclined to maximize the amount of product display for new products in a store, and advertisers would like to maximize the GMV of the product in promotional activities. Therefore, when using the realtime bidding algorithms for auto bidding, a key step is to figure out the advertiser’s marketing demands and the corresponding optimization goals, which will be further introduced in the next section.
3. Preliminaries
In this section, we first introduce our current advertiser recommender system, and formally define the advertiser’s demand and its optimization goals. We then introduce the designed strategy recommender system based on the existing system, and use the contextual bandit to model the strategy recommendation problem based on advertisers’ demands.
3.1. Prototype System
To solve the problem of high trial and error costs for advertisers, and to explore potential marketing opportunities, we deployed a recommender system for them as shown in Figure 2. The system consists of two parts: a bid optimization module and a crowd optimization module. The bid optimization module optimizes the advertiser’s bids based on her selected crowds, while the crowd optimization module explores potential users for the advertiser’s ad unit. Our current system is based on bid simulator, which estimates the advertising performance of possible strategies through simulated bidding on advertiser’s historical bidding logs.
Bid Optimization Module. The bid optimization module recommends proper bids for advertisers’ selected crowds. To search highquality bids for a crowd, the bid optimization module first samples some possible bids for the crowd. Then we predict the advertising performance through the bid simulator according to (crowd, bid), generating candidates for the recommendation. Finally, the bid optimization module ranks the candidate items by some predesigned rules, such as the relative increment on CTR (or CVR) compared with the average CTR (or CVR). The top K (crowd, bid) candidates of each ad unit would be recommended to the advertiser, showing their profile, historical advertising performance, current bids, suggested bids, and predicted advertising performance.
Crowd Optimization Module. The crowd optimization module explores potential user crowds for advertisers. It first merges similar crowds (crowds of the same type for products in the same category) and calculates the average historical KPIs of similar crowds through the bid simulator. Then the crowd optimization module ranks these merged crowds based on some humandesigned indicators (e.g. ). For each ad unit, the top K crowds would be recommended. For each recommended crowd, the crowd optimization module would predict the number of unique visitors (UV) covered based on bidding logs. It would also calculate a suggested bid based on average bid price in similar crowds, and show the crowd information, suggested bid, and UV to the advertiser.
The prototype system increases the ad platform’s average revenue per user (ARPU) by in our online A/B test in May 2020. The implementations of the system, experimental settings and results of the online evaluation are shown in Section 5.1. The current system has two points that can be improved: (1) The system ranks the candidate items based on deterministic metrics ( i.e. relative increment in some KPIs or fixed weighted average). However, advertisers have different marketing demands, ranking based on advertisers’ demands could better fulfill their personalized needs. (2) The current system gives recommendations based on a constant bid. In realtime bidding scenarios, since we use dynamic bidding strategies(Zhang et al., 2014, 2016; Cai et al., 2017; Wu et al., 2018) to replace the constant bid strategy for better advertising performance, the system should be able to recommend such dynamic bidding strategies.
3.2. Strategy Recommender System
To further improve our current prototype system, we first formulate our optimization problem, and design a new strategy recommender system. Most advertisers on Taobao are merchants who not only advertise but also sell products, and the Taobao display advertising system can collect user and advertisement data at the same time to help advertisers conduct simulated bidding and provide recommendations. We define the performance of an ad unit under budget constraint (or additional constraints) as: , where represents the number of KPIs, and represents the value of the
th KPI. Since advertisers have different preferences for different KPIs, we define advertisers’ demand as a demand weight vector:
, where means The advertiser’s preference weight for the th KPI. For a specific ad unit, assuming that the advertiser’s demand vector is , and an intuitive optimization goal of the advertiser is , which is a certain weighted sum of advertising performance (Guo et al., 2020). By defining as the advertiser’s bidding strategy, for advertisers whose demands are , recommending the optimal bidding strategy is to solve the following optimization problem:(1) 
where the advertising performance is the optimal result that the realtime bidding algorithm can achieve based on . To solve optimization problem 1, an intuitive solution is using some realtime bidding algorithms (Zhang et al., 2014, 2016; Cai et al., 2017; Wu et al., 2018) based on . Therefore, the demand vector in fact represents the optimization goals of the advertiser, and guides the advertising platform to obtain the most satisfactory bidding strategies for advertisers. However, due to the information asymmetry between the platform and advertisers, we cannot directly know .
To overcome the above obstacles, we design a new recommender system framework shown in Figure 3 to learn about advertisers’ demands based on the interaction between advertisers and the recommender system. For each advertiser request, the system processes sequentially as follows: (1) Recommendation module receives advertiser requests and obtains the advertiser’s ad unit characteristics. (2) For each candidate ad unit, recommendation module generates an estimated demand vector, and sends constraint information such as ad unit budget to the bidding module, inquiring about the advertising performance of simulated bidding. (3) Bidding module generate specific bidding strategy to bid based on the estimated demand and constraint information, and returns the result of simulated bidding achieved by bid simulator to recommendation module. (4) Recommendation module displays the result of the simulated bidding to the advertiser, which is easier to understand than recommending detailed bidding strategy settings. (5) Advertisers choose to adopt or not to adopt, and recommendation module collects feedback from advertisers. At the same time, if the advertiser adopts the recommendation strategy, bidding module will conduct realtime bidding based on the recommended strategy. (6) Bidding module sends realtime bidding results to advertisers periodically after the adoption.
According to the system process, when the platform makes a bidding strategy recommendation to the advertiser, the advertiser would adopt the recommendation if he is satisfied with the expected performance of the strategy. Otherwise, the advertiser would reject the suggestion. The adoption behaviors reflect advertisers’ satisfaction and interest in the recommended items, and the platform improves the recommendation service by maximizing the adoption rate. However, for a newly developed system, a significant number of advertisers are likely to be entirely new without any historical adoption record whatsoever, which is known as a coldstart problem (Park et al., 2006). Thus, it becomes indispensable to explore the advertisers’ interests in their bidding space for new advertisers. In the ecommerce platform, acquiring such information might reduce advertiser satisfaction in the short term and could be expensive, hence raising the problem of balancing two competing goals: maximizing advertiser satisfaction in each turn, and gathering information about advertiser’s demand actively for the longterm development of the recommender system.
The above problem is well known as a featurebased exploitation/exploration problem. In this work, we formulate the problem of optimizing advertiser’s adoption rate as a contextual bandit problem, a principled approach in which an agent recommends items sequentially based on contextual information of the ad units while adapting its action selection strategy simultaneously based on the advertisers’ feedback. We illustrate our modeling in detail in the next subsection.
3.3. Contextual Bandit Modeling
In this subsection, we formally define the contextual bandit problem, and show how it can model the advertising strategy recommendation problem. In this problem, the agent is the strategy recommender system, and all advertisers visiting the system constitute the environment that the agent faces. When any advertiser enters the recommendation web page, the agent needs to estimate the appropriate for each of the advertiser’s ad unit to be recommended based on the advertiser’s profile and historical behaviors observed by the agent. With the advertiser’s demand, we can easily derive the bidding strategy and simulate the corresponding advertising performance. Then, for each ad unit, the agent displays the predicted advertising performance to the advertiser. The advertiser’s adoption behavior can be regarded as the reward of the environment to the agent when they adopt or reject recommendations. We summarize the state , action , and reward of the contextual bandit problem as follows:

Status : Information related to the ad unit to be recommended that the platform can observe such as advertiser profile (current balance, historical advertising performance, etc.), ad unit profile (product category, product price, etc.), advertiser’s historical adoptionrelated information (recommendation strategy information, adoption or not, etc.), scenario information (recommendation scenario, time, etc.). We use feature vector to represent state .

Action : Estimate the advertiser’s demand vector . The action space of the agent is a highdimensional continuous space. Then the agent needs to send a request to the realtime bidding system according to the constraint information in the state and the action of its selection, to obtain predicted advertising performance.

Reward : We set the reward to be when the advertiser adopts the recommendation strategy, and the reward to be when the advertiser does not adopt the recommendation strategy. The expected reward represents the expected adoption rate under state and action .
Based on the above modeling, the agent continuously recommends strategies for visiting advertisers. We number the discrete recommendation rounds as In the th round, the contextual bandit algorithm works as follows:

Agent observes the feature vector of the current ad unit.

Based on , the agent predicts a demand vector , and transmits constraint information in feature vector together with to the realtime bidding system, the realtime bidding system performs simulated bidding and returns the result to the agent. Then the agent recommends to the advertiser and gets reward .

The observation of the current round is stored to help the agent update its future strategy.
In the above process, the Ttrial Payoff for rounds recommendations is . Similarly, we define the Optimal Expected Ttrial Payoff for rounds recommendations as , where the demand vector is the demand of the advertiser in the th round, and the optimal bidding strategy based on would get the results with maximum utility for advertisers in the bidding space. Our goal in designing the agent is to maximize the Expected Ttrial Payoff of the rounds recommendations . Equivalently, we can also regard the goal as minimizing the Expected Ttrial Regret for rounds recommendations as follows:
(2) 
The Expected Ttrial Regret can be interpreted as the gap between the expected adoption amount of the optimal recommendation strategy and the actual recommendation strategy under rounds of recommendations, and is the gap between the expected adoption rate of the optimal recommendation strategy and the actual recommendation strategy .
4. Algorithm Design
In the classic contextual bandit algorithm, the agent learns the expected reward of each arm (i.e. each optional action), and chooses to pull the arm based on some strategies. In our problem, the reward of the action is the advertiser’s adoption behavior, and we need to maximize the reward by recommending the strategy most likely to be adopted by the advertiser. In classic contextual bandit algorithms, the number of possible actions is usually discrete and finite, while the action space of strategy recommending is a highdimensional continuous space, and we cannot directly output each action’s estimated reward at the same time, which makes it difficult to directly apply existing bandit algorithms like greedy or upper confidence bound. To solve the above problem, we disassemble the action reward learning process into two steps: (1) Set up the action selection strategy based on observable information. (2) Establish the relationship between (estimated action, simulated bidding result) and the advertiser’s adoption rate.
To model the relationship between the advertiser’s information and the bandit agent’s selected action, we first build the connection between advertiser information and their advertising demand. More specifically, based on the action selection strategy, we obtain the demand vector under the state that the platform can observe as:
(3) 
where the function is the mapping relationship between the environmental state and the demand , the input x of is the representation of the state , and the output of the network is
. We use a multilayer perceptron model (MLP) to build such a mapping relationship, and the detailed action selection strategy based on the model will be introduced later. In our problem, the label of the network (i.e. the reward of the action) is the adoption behavior of the advertiser. Intuitively speaking, let
be the optimal bidding result under , then the value of reflects the utility of advertising performance which an advertiser with demand can obtain on the advertising platform. Since the advertiser’s adoption rate is positively correlated with , we can model the relation of and the adoption rate as follows:(4) 
where
is the sigmoid function with value range
, and the optimal bidding result based on is also part of the model input. In practice, bidding result could be normalized by the advertiser’s budget level since we only care about the relative value in and .Based on the above method, the estimation of the action value of the network can be updated through gradient descent. For each iteration, the model first observes the environmental feature x and estimates the demand , gets the bidding result according to , and predicts the adoption rate
. Then, we update the parameters of the model through the following loss function:
(5) 
where the set is the data set of size in this update iteration, and the label is the advertiser’s real adoption behavior.
Action selection strategy. In the context of advertising strategy recommendation, exploration refers to recommending strategies based on new possible demands to advertisers to explore the potential interests of advertisers, exploitation refers to recommending corresponding strategies based on current demand inference, which is the output of the model. Thompson Sampling is a popular way to solve the exploitation and exploration problem (Thompson, 1933). Generally speaking, Thompson Sampling requires Bayesian processing of model parameters. In each step, Thompson Sampling samples a new set of model parameters and then selects actions based on the set of parameters. This can be seen as a kind of random hypothesis test: the more likely parameters are sampled more frequently and thus be rejected or confirmed more quickly. Specifically, the process of Thompson Sampling is as follows:

Sample a new set of parameters for the model.

Select the action with the highest expected reward based on the sampling parameters.

Update the model and go back to 1.
Performing Thompson Sampling in a neural network needs to describe the uncertainty of the network. Bayesian models (MacKay, 1992; Neal, 1995; Blundell et al., 2015) offer a mathematically grounded framework to represent the uncertainty of the models, but usually, come with a prohibitive computational cost (Gal and Ghahramani, 2016). To solve the problem, we apply Dropout, which is a simple but efficient technique to prevent overfitting of neural networks (Srivastava et al., 2014)
, as a Bayesian approximation method to express the uncertainty of the model in deep learning since it is proved that a nonlinear neural network with any depth, if applying Dropout, is mathematically equivalent to the approximation of the deep Gaussian Process
(Gal and Ghahramani, 2016). By implementing Thompson Sampling through Dropout technique, we can well balance exploitation and exploration during action selection.5. Experiments
In this section, we first evaluate our current prototype system through online evaluation, then we design a simulation environment for experiments and conduct detailed evaluations of modeling in section 3 and algorithm design in section 4. We summarize our experimental results as follows: (1) The online evaluation indicates the effectiveness of introducing a recommender system toward advertisers, which not only increases the revenue of the ad platform but also optimizes the advertiser’s advertising performance. (2) We verify the effectiveness of the neural network in the task of advertiser adoption rate optimization, which requires the agent to accurately predict the advertiser demand. (3) We verify that Dropout can better weigh the pros and cons between exploiting existing demand information and exploring the potential demands of advertisers through comparison experiments. (4) We verify the generalization ability of the bandit model through an ablation study.
5.1. System Online Evaluation
Metric  Result  Metric  Result 
ARPU  RPM  
Click Number  Payment Number  
CTR  Payment Amount  
Cost  CVR  
ROI  PV  
System Implementation. We have deployed our prototype system mentioned in Section 3.1 in the Taobao display advertising platform since February 2020. The detailed implementation of the system includes three main components: First, we maintain two individual databases for bid optimization module and crowd optimization module that store (crowd, bid) items to be recommended for different ads, the recommended items are updated each day. Second, we implement a bid simulator, which leverages the database about advertisers’ bidding logs sampled from online auctions, to predict the advertising performance using a specific bidding strategy. Finally, we maintain a database that stores the realtime advertising performance of the advertisers, this database also shows the realtime advertising performance to advertisers in the dashboard of the ad platform.
When the advertiser visits the recommendation page, the recommender system first selects the ad’s profile and its current advertising performance. Then the system selects the recommended strategies (i.e. (crowd, bid)) for each ad and predicts the advertising performance of these strategies by the bid simulator. Finally, the system shows these recommendations to the advertiser. The advertisers can directly adopt the recommendation, and the system would take effect the adopted recommendations.
Evaluation Results. In our online experiment between 20200514 and 20200527, we carefully select advertisers with the same consumption level based on their historical Average Revenue per User (ARPU), the daily cost of advertisers in our experiments constitutes about of the total daily cost of all the advertisers in the platform. To be mentioned, none of the advertisers in our experiment had used the recommender system before 20200514. In the A/B test, half of the advertisers using the recommender system in the evaluation period form the test group, and the others form the control group. The average open rate of the bid optimization module and crowd optimization module in the test group is and respectively, indicating that advertisers are willing to use our system. The average adoption rate of ad units in bid recommendation and crowd recommendation is and respectively, and we find that advertisers would select and adopt a few suggestions based on the recommendation list carefully and cautiously, revealing the opportunity to introduce a learning module for personalized recommendation. The ARPU in the test group increases by compared with the control group, indicating that advertisers using the recommender system are more willing to invest in advertising, thus increase the revenue of the ad platform. Moreover, as shown in Table 1, the overall advertising performance of ad units in the test group is better than those in the control group, which verifies that the recommender system can optimize advertisers’ advertising performance.
5.2. Simulation Settings
Compared with machine learning in a standard supervised environment, evaluations of bandit algorithms are very difficult
(Li et al., 2010). Besides, bandit algorithms usually require a lot of interaction with the environment, making it difficult to conduct experiments in a real physical environment due to the cost of trial and error (Shi et al., 2019). Therefore, most of the previous researches verify the effectiveness of the algorithm by building a simulation environment (Ie et al., 2019; Hao et al., 2020). In this section, we introduce the design of our simulation environment, which are the advertiser module and bidding module in the system structure of Figure 3. The advertiser module simulates the interaction between the advertisers and the recommendation module in the real advertising system. The bidding module obtains the expected advertising performance based on ad units’ bidding logs.Bidding module: The bidding module optimizes the objective function under budget constraint. The advertising indicators in the simulation environment are PV, Click Number and GMV. We use the offline linear programming to derive the optimal bidding strategy (Zhang et al., 2016; Wu et al., 2018) based on the ad unit’s bidding logs, and calculate the corresponding advertising performance.
Advertiser module: The design purpose of the advertiser module is: (1) Verify the strategy recommendation ability of the algorithm. (2) Verify the generalization ability of the algorithm. In the simulation environment, each ad unit has a budget and a demand vector . We first introduce the demand vector generation process. Since we cannot know in advance the relationship between advertisers’ demands and the information that the platform can observe in the production environment, to verify the generalization of the algorithm, we generate the advertiser’s demand vector as , where is a function to generate based on a parameter set, and the parameters in the parameter set represent advertiser characteristics observed by the real advertising platform. In the experiment, the model can observe all or part of the parameters in the parameter set. Specifically, we design the following scheme: given the feature matrix composed of the typical demand vectors of a given advertiser is: , and the advertiser’s feature vector , where the value means that the weight of the th typical demand in the feature matrix, the demand vector of the advertiser is simply obtained by . In simulation, the agent can observe the feature matrix and the feature vector , which represent the generalized information. It is worth noted that other relations can also be designed and tested, but we cannot know the relation in the production environment in advance.
Then we model the advertiser’s adoption behavior. We leverage the conditional logit model that is widely used in user selection modeling
(Louviere et al., 2000)to simulate the adoption behavior of advertisers. In the conditional logit model, the probability that user
selects item in the recommended set is , where is the utility of the commodity to the user . The utility of the user selecting a “null” commodity (not selecting any commodity) is usually expressed by a constant (Ie et al., 2019). Therefore, we design the advertiser’s probability to adopt the recommended item in the simulation environment as:(6) 
where the vector represents the advertiser’s demand, and the vector represents the advertising performance obtained through bidding according to the advertiser’s inner demand . The vector represents the advertising performance obtained through bidding under the model’s estimated demand , and the value represents the utility of advertiser’s rejection. The utility function represents the utility function of the advertiser’s adoption, we design , where is the adjustment factor for adjusting the overall adoption rate of advertisers in the simulation environment. In the utility function, the term indicates the reciprocal of , which is the relative utility gap between the recommended advertising performance and the best advertising performance under the advertiser’s demand . To prevent division by zero errors, we set the denominator in above formulas greater than or equal to a small positive number . It can be seen from Formula 6 that when the utility of the recommended strategy is closer to the optimal utility, the advertiser’s adoption rate is higher, which is convenient for us to compare the effectiveness of different models.
In the initialization phase of the advertiser module, several ad units and corresponding demands are generated, and the visit event of the recommender system is randomly triggered. The ad units in the simulation environment share historical bidding logs.
Evaluation metrics and training parameters: The optimization goal of the contextual bandit algorithm is the Expected Ttrial Regret in rounds shown in Formula 2. We can get the accumulated expected regret through and accumulated adoption rate through to evaluate the performance of the model, where indicates that the experiment has conducted rounds of interactions. The value represents the adoption rate of recommendation based on the advertiser’s inner demand in the th round, and represents the adoption rate of recommendation based on given by the action selection algorithm in the th round.
In the simulation experiments, input features of the model are demandrelated features and advertiser’s adoption behaviors related to the ad unit. For demandrelated features, we concatenate them together as one of the inputs of the model. For advertiser’s adoption behaviors, we apply average pooling to the corresponding model output in historical adoption behaviors related to the ad. We use minibatch SGD to train the model while it interacts with the environment, and we leverage Adam (Kingma and Ba, 2014) as the optimizer. In order to prevent the imbalance between positive and negative samples to affect the model performance, we set the ratio of positive and negative samples in each training batch to 1:1. The source code is available online ^{1}^{1}1https://github.com/liyiG/recsys_for_advertiser.
5.3. Experimental Results
Exploration of the advertising performance space: First, we briefly explore the advertising performance space of advertisers based on different demands in the Taobao online display ad auction environment. In the experiment, we select three types of typical demands, representing three types of typical advertisers in the platform: maximizing total impressions, maximizing total clicks, and maximizing GMV. When the indicators in the advertising performance report are PV, Click Number and GMV in turn, the demand vectors of these three types of advertisers are , and , respectively. We conduct optimal bidding under the budget constraint based on the above three demand vectors and the historical bidding logs of the same ad unit. The experimental results are shown in Table 2. To make the results more straightforward, we regularize the experimental results of each indicator in Table 2 according to the maximum value of this indicator. For example, relative PV of the th row is calculated by . From Table 2, it can be seen that the performance of indicators to be optimized by different types of advertisers has been significantly improved under the corresponding optimal demand vector, which shows the importance of understanding advertisers’ demands.
Demand Type  PV  Click Number  GMV 

Optimize PV  1.0000  0.7747  0.6751 
Optimize Click Number  0.7825  1.0000  0.8269 
Optimize GMV  0.7377  0.8454  1.0000 
Model  AER  AAR 
Random Demand (without learning module)  1.0000  0.7606 
No Dropout  0.7429  0.8239 
Dropout rate (no demandrelated info)  0.7195  0.8535 
Dropout rate  0.5384  0.9144 
Dropout rate  0.4154  0.9725 
Dropout rate  0.4078  0.9817 
Dropout rate  0.3881  1.0000 
Comparison experimental results: We verify the effectiveness of our contextual bandit algorithm through comparison experiments, in which we compare models with different Dropout rates or without Dropout. We also implement a random demand recommendation strategy which does not apply any demand estimation algorithm as a weak baseline. In each experiment, the agent interacts with the environment for rounds, and updates accumulated expected regret and accumulated adoption rate periodically, the experimental results after the interactions are shown in Table 3. In Table 3, we find that the random demand recommender causes huge degradation in advertiser satisfaction, which shows the necessity of considering advertiser demand when recommending strategies. The curves of accumulated expected regret and accumulated adoption rate for different Dropout ratios are shown in Figure 4, since we found that different algorithms would converge to different local optimal solutions in the experiment, and the expected regret would increase approximately linearly after the model converges, to better see the performance difference after the model converges, we preprocess the accumulated expected regret by , and normalize the experimental results to draw the curves. Through observing the curves in Figure 4, and analyzing the realtime accumulated expected regret and accumulated adoption rate during the experiments, we find that the growth speed of accumulated expected regret in all models first decreases and converges, the accumulated adoption rate of all models first increases gradually and then converges. The above observations show that different models converge to different local optimal solutions, but they can all learn the demands of advertisers to a certain extent and improve the performance of the recommender system. For example, from Table 3, even model without Dropout can reduce the accumulated expected regret by compared to the random demand recommendation strategy (without learning module).
In the experiments, the action selection algorithm that uses Dropout for action exploration is more effective than the action selection algorithm that does not use Dropout. This is because the action sampling using Dropout can be approximated as Thompson Sampling (Gal and Ghahramani, 2016), which helps the model converge to a better local optimal solution, increasing the AAR of 18.6%. In comparison experiments with the Dropout ratio of , , , and , we observe that as the Dropout ratio increases, the performance of the model first improves and then decreases. This may be because when the Dropout ratio is low, the model adopts a conservative exploration strategy, which is more likely to converge to a relatively poor local optimal solution; and when the dropout ratio is high, the model frequently explores the action space, which cannot make full use of the learned knowledge, resulting in performance degradation. From Figure 3(b), it can be seen that in the initial stage of the interaction, the accumulated adoption rate may decrease, which may be caused by the large uncertainty of the model in the initial training stage. After analyzing the realtime accumulated expected regret and accumulated adoption rate during the experiments, we find that when the accumulated adoption rate decreases, the accumulated expected regret’s growth speed decreases significantly at the same time, indicating that exploration can make the model better learn the advertisers’ demands.
To verify the generalization ability of the model, we conduct another control experiment. In the experiment, the experimental group is a model with a Dropout ratio of , and the control group is the same model without the demandrelated information (only historical adoption information can be observed). We preprocess the experimental results similar to Figure 4, and show the results in Figure 5 and Table 3. From the experimental results in the Figure 5 and Table 3, the performance of the model with demandrelated information is better than the model without this information in both accumulated expected regret and accumulated adoption rate, which reflects that the model can better learn the demands of advertisers through understanding information related to , thus verifying the generalization performance of the model.
6. Conclusion
In this work, we study the problem of advertising strategy recommendation based on the understanding of advertisers’ demands. We first design a prototype system for bid price and crowd recommendation, and prove its effectiveness through online A/B test. Based on the prototype system, we further design a dynamic bidding strategy recommender system, which is modeled as a contextual bandit problem. After modeling advertisers’ demands and utility, we leverage a deep model to estimate advertisers’ demands, and derive the optimal bidding strategy and the corresponding advertising result based on the advertiser demand through simulated bidding. By displaying the advertising performance to advertisers, we can observe their adoption behavior, which is treated as the reward of the contextual bandit. By optimizing the advertiser adoption rate, we can improve the ability of demand estimation and strategy recommendation. To balance exploration and exploitation during action selection in the contextual bandit, we use Dropout to represent the uncertainty of the neural network, and to implement Thompson Sampling. Experiments in a simulation environment verify the effectiveness of the system and Dropout in the task of adoption rate optimization. We also verify that demandrelated information helps the agent better learn the demands of advertisers through ablation experiments. The prototype system now has been deployed to recommend bids and crowds for advertisers in the displaying advertising platform in Alibaba, optimizing their advertising performance and the platform’s revenue.
References
 (1)
 Basilico and Hofmann (2004) Justin Basilico and Thomas Hofmann. 2004. Unifying collaborative and contentbased filtering. In ICML. 9.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight uncertainty in neural networks. In ICML. 1613–1622.
 Cai et al. (2017) Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Realtime bidding by reinforcement learning in display advertising. In WSDM. 661–670.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In DLRS. 7–10.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In RecSys. 191–198.
 Evans (2008) David S Evans. 2008. The economics of the online advertising industry. Review of Network Economics 7, 3 (2008), 1–33.
 Evans (2009) David S Evans. 2009. The online advertising industry: Economics, evolution, and privacy. Journal of Economic Perspectives 23, 3 (2009), 37–60.
 Gal and Ghahramani (2016) Y Gal and Z Ghahramani. 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In ICML. 1651–1660.
 Goldfarb and Tucker (2011) Avi Goldfarb and Catherine Tucker. 2011. Online display advertising: Targeting and obtrusiveness. Marketing Science 30, 3 (2011), 389–404.
 Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorizationmachine based neural network for CTR prediction. In IJCAI. 1725–1731.
 Guo et al. (2020) Liyi Guo, Rui Lu, Haoqi Zhang, Junqi Jin, Zhenzhe Zheng, Fan Wu, Jin Li, Haiyang Xu, Han Li, Wenkai Lu, et al. 2020. A Deep Prediction Network for Understanding Advertiser Intent and Satisfaction. In CIKM. 2501–2508.
 Hao et al. (2020) Xiaotian Hao, Zhaoqing Peng, Yi Ma, Guan Wang, Junqi Jin, Jianye Hao, Shan Chen, Rongquan Bai, Mingzhou Xie, Miao Xu, et al. 2020. Dynamic knapsack optimization towards efficient multichannel sequential advertising. In ICML. 4060–4070.
 Ie et al. (2019) Eugene Ie, Vihan Jain, Jing Wang, Sanmit Narvekar, Ritesh Agarwal, Rui Wu, HengTze Cheng, Tushar Chandra, and Craig Boutilier. 2019. SLATEQ: a tractable decomposition for reinforcement learning with recommendation sets. In IJCAI. 2592–2599.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Kitts et al. (2017) Brendan Kitts, Michael Krishnan, Ishadutta Yadav, Yongbo Zeng, Garrett Badeau, Andrew Potter, Sergey Tolkachov, Ethan Thornburg, and Satyanarayana Reddy Janga. 2017. Ad Serving with Multiple KPIs. In SIGKDD. 1853–1861.

Lahaie
et al. (2007)
Sébastien Lahaie,
David M Pennock, Amin Saberi, and
Rakesh V Vohra. 2007.
Sponsored search auctions.
Algorithmic Game Theory
1 (2007), 699–716.  Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextualbandit approach to personalized news article recommendation. In WWW. 661–670.
 Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com recommendations: Itemtoitem collaborative filtering. IEEE Internet Computing 7, 1 (2003), 76–80.
 Lops et al. (2011) Pasquale Lops, Marco De Gemmis, and Giovanni Semeraro. 2011. Contentbased recommender systems: State of the art and trends. In Recommender Systems Handbook. 73–105.
 Louviere et al. (2000) Jordan J Louviere, David A Hensher, and Joffre D Swait. 2000. Stated choice methods: analysis and applications. Cambridge University Press.
 Lu et al. (2015) Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, and Guangquan Zhang. 2015. Recommender system application developments: a survey. Decision Support Systems 74 (2015), 12–32.

MacKay (1992)
David JC MacKay.
1992.
A practical Bayesian framework for backpropagation networks.
Neural computation 4, 3 (1992), 448–472.  Maehara et al. (2018) Takanori Maehara, Atsuhiro Narita, Jun Baba, and Takayuki Kawabata. 2018. Optimal bidding strategy for brand advertising. In IJCAI. 424–432.
 Neal (1995) Radford M Neal. 1995. Bayesian learning for neural networks. Ph.D. Dissertation. University of Toronto.
 Park et al. (2006) SeungTaek Park, David Pennock, Omid Madani, Nathan Good, and Dennis DeCoste. 2006. Naïve filterbots for robust coldstart recommendations. In SIGKDD. 699–705.
 Pazzani and Billsus (2007) Michael J Pazzani and Daniel Billsus. 2007. Contentbased recommendation systems. In The Adaptive Web. 325–341.
 Ren et al. (2017) Kan Ren, Weinan Zhang, Ke Chang, Yifei Rong, Yong Yu, and Jun Wang. 2017. Bidding machine: Learning to bid for directly optimizing profits in display advertising. IEEE Transactions on Knowledge and Data Engineering 30, 4 (2017), 645–659.
 Sarwar et al. (2001) Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Itembased collaborative filtering recommendation algorithms. In WWW. 285–295.
 Schafer et al. (2007) J Ben Schafer, Dan Frankowski, Jon Herlocker, and Shilad Sen. 2007. Collaborative filtering recommender systems. In The Adaptive Web. 291–324.
 Shi et al. (2019) JingCheng Shi, Yang Yu, Qing Da, ShiYong Chen, and AnXiang Zeng. 2019. Virtualtaobao: Virtualizing realworld online retail environment for reinforcement learning. In AAAI. 4902–4909.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
 Thompson (1933) William R Thompson. 1933. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25, 3/4 (1933), 285–294.
 Wu et al. (2018) Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. 2018. Budget constrained bidding by modelfree reinforcement learning in display advertising. In CIKM. 1443–1451.
 Yang et al. (2019) Xun Yang, Yasong Li, Hao Wang, Di Wu, Qing Tan, Jian Xu, and Kun Gai. 2019. Bid optimization by multivariable control in display advertising. In SIGKDD. 1966–1974.
 Zhang et al. (2016) Weinan Zhang, Yifei Rong, Jun Wang, Tianchi Zhu, and Xiaofan Wang. 2016. Feedback control of realtime display advertising. In WSDM. 407–416.
 Zhang et al. (2014) Weinan Zhang, Shuai Yuan, and Jun Wang. 2014. Optimal realtime bidding for display advertising. In SIGKDD. 1077–1086.
 Zhang et al. (2012) Weinan Zhang, Ying Zhang, Bin Gao, Yong Yu, Xiaojie Yuan, and TieYan Liu. 2012. Joint optimization of bid and budget allocation in sponsored search. In SIGKDD. 1177–1185.
 Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for clickthrough rate prediction. In SIGKDD. 1059–1068.
 Zhu et al. (2017) Han Zhu, Junqi Jin, Chang Tan, Fei Pan, Yifan Zeng, Han Li, and Kun Gai. 2017. Optimized cost per click in taobao display advertising. In SIGKDD. 2191–2200.
 Zhu et al. (2018) Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. 2018. Learning treebased deep model for recommender systems. In SIGKDD. 1079–1088.