rtb
Shouldering: RTB
view repo
The majority of online display ads are served through realtime bidding (RTB)  each ad display impression is auctioned off in realtime when it is just being generated from a user visit. To place an ad automatically and optimally, it is critical for advertisers to devise a learning algorithm to cleverly bid an ad impression in realtime. Most previous works consider the bid decision as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget runs out. As such, each bid is strategically correlated by the constrained budget and the overall effectiveness of the campaign (e.g., the rewards from generated clicks), which is only observed after the campaign has completed. Thus, it is of great interest to devise an optimal bidding strategy sequentially so that the campaign budget can be dynamically allocated across all the available impressions on the basis of both the immediate and future rewards. In this paper, we formulate the bid decision process as a reinforcement learning problem, where the state space is represented by the auction information and the campaign's realtime parameters, while an action is the bid price to set. By modeling the state transition via auction competition, we build a Markov Decision Process framework for learning the optimal bidding policy to optimize the advertising performance in the dynamic realtime bidding environment. Furthermore, the scalability problem from the large realworld auction volume and campaign budget is well handled by state value approximation using neural networks.
READ FULL TEXT VIEW PDF
Realtime bidding (RTB) is almost the most important mechanism in online...
read it
Realtime bidding (RTB) based display advertising has become one of the ...
read it
Bidding optimization is one of the most critical problems in online
adve...
read it
Real time bidding (RTB) enables demand side platforms (bidders) to scale...
read it
Maximizing utility with a budget constraint is the primary goal for
adve...
read it
In the past decade, Real Time Bidding (RTB) has become one of the most c...
read it
We consider an online ad network problem in which an ad exchange auction...
read it
Shouldering: RTB
The increased availability of big data and the improved computational power have advanced machine learning and artificial intelligence for various prediction and decision making tasks. In particular, the successful application of reinforcement learning in certain settings such as gaming control
[19] has demonstrated that machines not only can predict, but also have a potential of achieving comparable humanlevel control and decision making. In this paper, we study machine bidding in the context of display advertising. Auctions, particularly realtime bidding (RTB), have been a major trading mechanism for online display advertising [30, 7, 23]. Unlike the keywordlevel bid decision in sponsored search [1], the advertiser needs to make the impressionlevel bid decision in RTB, i.e., bidding for every single ad impression in real time when it is just being generated by a user visit [18, 30] . Machine based bid decision, i.e., to calculate the strategic amount that the advertiser would like to pay for an ad opportunity, constitutes a core component that drives the campaigns’ ROI [11, 32]. By calculating an optimal bid price for each ad auction (also considering the remaining budget and the future availability of relevant ad impressions in the ad exchange) and then observing the auction result and user feedback, the advertiser would be able to refine their bidding strategy and better allocate the campaign budget across the online page view volume.A straightforward bidding strategy in RTB display advertising is mainly based on the truthfulness of secondprice auctions [6], which means the bid price for each ad impression should be equal to its true value, i.e., the action value (e.g., click value) multiplied by the action rate (e.g., clickthrough rate) [14]. However, for budgeted bidding in repeated auctions, the optimal bidding strategy may not be truthful but depends on the market competition, auction volume and campaign budget [31]. In [18, 32], researchers have proposed to seek the optimal bidding function that directly maximizes the campaign’s key performance indicator (KPI), e.g., total clicks or revenue, based on the static distribution of input data and market competition models. Nevertheless, such static bid optimization frameworks may still not work well in practice because the RTB market competition is highly dynamic and it is fairly common that the true data distribution heavily deviates from the assumed one during the model training [4], which requires additional control step such as budget pacing [28] to constrain the budget spending.
In this paper, we solve the issue by considering bidding as a sequential decision, and formulate it as a reinforcement learning to bid (RLB) problem. From an advertiser’s perspective, the whole ad market and Internet users are regarded as the environment. At each timestep, the advertiser’s bidding agent observes a state, which consists of the campaign’s current parameters, such as the remaining lifetime and budget, and the bid request for a specific ad impression (containing the information about the underlying user and their context) . With such state (and context), the bidding agent makes a bid action for its ad. After the ad auction, the winning results with the cost and the corresponding user feedback will be sent back to the bidding agent, which forms the reward signal of the bidding action. Thus, the bid decision aims to derive an optimal bidding policy for each given bid request.
With the above settings, we build a Markov Decision Process (MDP) framework for learning the optimal bidding policy to optimize the advertising performance. The value of each state will be calculated by performing dynamic programming. Furthermore, to handle the scalability problem for the realworld auction volume and campaign budget, we propose to leverage a neural network model to approximate the value function. Besides directly generalizing the neural network value function, we also propose a novel coarsetofine episode segmentation model and state mapping models to overcome the largescale state generalization problem.
In our empirical study, the proposed solution has achieved 16.7% and 7.4% performance gains against the stateoftheart methods on two largescale realworld datasets. In addition, our proposed system has been deployed into a commercial RTB platform. We have performed an online A/B testing, where a 44.7% improvement in click performance was observed against a most widely used method in the industry.
Reinforcement Learning. An MDP provides a mathematical framework which is widely used for modelling the dynamics of an environment under different actions, and is useful for solving reinforcement learning problems [21]. An MDP is defined by the tuple . The set of all states and actions are represented by and
respectively. The reward and transition probability functions are given by
and . Dynamic programming is used in cases where the environment’s dynamics, i.e., the reward function and transition probabilities are known in advance. Two popular dynamic programming algorithms are policy iteration and value iteration. For largescale situations, it is difficult to experience the whole state space, which leads to the use of function approximation that constructs an approximator of the entire function [8, 22]. In this work, we use value iteration for smallscale situations, and further build a neural network approximator to solve the scalability problem.RTB Strategy. In the RTB process [32]
, the advertiser receives the bid request of an ad impression with its realtime information and the very first thing to do is to estimate the
utility, i.e., the user’s response on the ad if winning the auction. The distribution of the cost, i.e., the market price [1, 5], which is the highest bid price from other competitors, is also forecasted by the bid landscape forecasting component. Utility estimation and bid landscape forecasting are described below. Given the estimated utility and cost factors, the bidding strategy [30]decides the final bid price with accessing the information of the remaining budget and auction volume. Thus it is crucial to optimize the final bidding strategy considering the market and bid request information with budget constraints. A recent comprehensive study on the data science of RTB display advertising is posted in
[24].Utility Estimation. For advertisers, the utility is usually defined based on the user response, i.e., click or conversion, and can be modeled as a probability estimation task [14]. Much work has been proposed for user response prediction, e.g., clickthrough rate (CTR) [16], conversion rate (CVR) [14] and postclick conversion delay patterns [3]
. For modeling, linear models such as logistic regression
[14] and nonlinear models such as boosted trees [10] and factorization machines [17] are widely used in practice. There are also online learning models that immediately perform updating when observing each data instance, such as Bayesian probit regression [9], FTRL learning in logistic regression [16]. In our paper, we follow [14, 32] and adopt the most widely used logistic regression for utility estimation to model the reward on agent actions.Bid Landscape Forecasting. Bid landscape forecasting refers to modeling the market price distribution for auctions of specific ad inventory, and its c.d.f. is the winning probability given each specific bid price [5]. The authors in [15, 32, 5]
presented some hypothetical winning functions and learned the parameters. For example, a lognormal market price distribution with the parameters estimated by gradient boosting decision trees was proposed in
[5]. Since advertisers only observe the winning impressions, the problem of censored data [1, 33] is critical. Authors in [27]proposed to leverage censored linear regression to jointly model the likelihood of observed market prices in winning cases and censored ones with losing bids. Recently, the authors in
[25] proposed to combine survival analysis and decision tree models, where each tree leaf maintains a nonparametric survival model to fit the censored market prices. In this paper, we follow [1, 33] and use a nonparametric method to model the market price distribution.Bid Optimization. As has been discussed above, bidding strategy optimization is the key component within the decision process for the advertisers [18]. The auction theory [12] proved that truthful bidding is the optimal strategy under the secondprice auction. However, truthful bidding may perform poorly when considering the multiple auctions and the budget constraint [32]. In realworld applications, the linear bidding function [18] is widely used. The authors in [32]
empirically showed that there existed nonlinear bidding functions better than the linear ones under variant budget constraints. When the data changes, however, the heuristic model
[18] or hypothetical bidding functions [31, 32] cannot depict well the real data distribution. The authors in [1, 29] proposed the modelbased MDPs to derive the optimal policy for bidding in sponsored search or ad selection in contextual advertising, where the decision is made on keyword level. In our work, we investigate the most challenging impressionlevel bid decision problem in RTB display advertising that is totally different from [1, 29]. We also tackle the scalability problem, which remains unsolved in [1], and demonstrate the efficiency and effectiveness of our method in a variety of experiments.In a RTB ad auction, each bidding agent acts on behalf of its advertiser and generates bids to achieve the campaign’s specific target. Our main goal is to derive the optimal bidding policy in a reinforcement learning fashion. For most performancedriven campaigns, the optimization target is to maximize the user responses on the displayed ads if the bid leads to auction winning. Without loss of generality, we consider clicks as our targeted user response objective, while other KPIs can be adopted similarly. The diagram of interactions between the bidding agent and the environment is shown in Figure 1.
Mathematically, we consider bidding in display advertising as an episodic process [1]; each episode comprises
auctions which are sequentially sent to the bidding agent. Each auction is represented by a high dimensional feature vector
, which is indexed via onehot binary encoding. Each entry of corresponds to a category in a field, such as the category London in the field City, and the category Friday in the field Weekday. The fields consist of the campaign’s ad information (e.g., ad creative ID and campaign ID) and the auctioned impression contextual information (e.g., user cookie ID, location, time, publisher domain and URL).At the beginning, the agent is initialized with a budget , and the advertising target is set to acquire as many clicks as possible during the following auctions. Three main pieces of information are considered by the agent (i) the remaining auction number ; (ii) the unspent budget and (iii) the feature vector . During the episode, each auction will be sent to the agent sequentially and for each of them the agent needs to decide the bid price according to the current information , and .
The agent maintains the remaining number of auctions and the remaining budget . At each timestep, the agent receives an auction (the feature vector space), and determines its bid price . We denote the market price p.d.f. given as , where is the market price and is its probability. If the agent bids at price , then it wins the auction and pays , and the remaining budget changes to . In this case, the agent can observe the user response and the market price later. Alternatively, if losing, the agent gets nothing from the auction. We take predicted CTR (pCTR) as the expected reward, to model the action utility. After each auction, the remaining number of auctions changes to . When the auction flow of this episode runs out, the current episode ends and both the remaining auction number and budget are reset.
A Markov Decision Process (MDP) provides a framework that is widely used for modeling agentenvironment interactions [21]. Our notations are listed in Table 1. An MDP can be represented by a tuple , where and are two sets of all states and all possible actions in state , respectively. represents the state transition probability from state to another state when taking action , which is denoted by . Similarly, is the reward function denoted by that represents the reward received after taking action in state and then transiting to state .
Notation  Description 

The feature vector that represents a bid request.  
The whole feature vector space.  
The probability density function of . 

The predicted CTR (pCTR) if winning the auction of .  
The p.d.f. of market price given .  
The p.d.f. of market price .  
The expected total reward with starting state ,  
taking the optimal policy.  
The expected total reward with starting state ,  
taking the optimal policy.  
The optimal action in state . 
We consider as a state ^{1}^{1}1For simplicity, we slightly abuse the notation by including in the state. and assume the feature vector is drawn i.i.d. from the probability density function . The full state space is . And if , the state is regarded as a terminal state which means the end of the episode. The set of all actions available in state is , corresponding to the bid price. Furthermore, in state where , the agent, when bidding , can transit to with probability where and . That is the case of winning the auction and receiving a reward . And the agent may lose the auction whereafter transit to with probability , where . All other transitions are impossible because of the auction process. In summary, transition probabilities and reward function can be written as:
(1) 
where , and . Specifically, the first equation is the transition when giving a bid price , while the second equation is the transition when losing the auction.
A deterministic policy is a mapping from each state to action , i.e., , which corresponds to the bidding strategy in RTB display advertising. According to the policy , we have the value function : the expected sum of rewards upon starting in state and obeying policy . This satisfies the Bellman equation with the discount factor since in our scenario the total click number is the optimization target, regardless of the click time.
(2) 
The optimal value function is defined as . We also have the optimal policy as:
(3) 
which gives the optimal action at each state and . The optimal policy is exactly the optimal bidding strategy we want to find. For notation simplicity, in later sections, we use to represent the optimal value function .
One may consider the possibility of modelfree approaches [26, 20] to directly learn the bidding policy from experience. However, such modelfree approaches may suffer from the problems of transition dynamics of the enormous state space, the sparsity of the reward signals and the highly stochastic environment. Since there are many previous works on modeling the utility (reward) and the market price distribution (transition probability) as discussed in Section 2, we take advantage of them and propose our modelbased solution for this problem.
In a small scale, Eq. (3) can be solved using a dynamic programming approach. As defined, we have the optimal value function , where represents the state . Meanwhile, we consider the situations where we do not observe the feature vector ; so another optimal value function is : the expected total reward upon starting in without observing the feature vector when the agent takes the optimal policy. It satisfies . Also that, we have the optimal policy and express it as the optimal action .
From the definition, we have as the agent gets nothing when there are no remaining auctions. Combined with the transition probability and reward function described in Eq. (3.2), the definition of , can be expressed with as
(4) 
where the first summation^{2}^{2}2In practice, the bid prices in various RTB ad auctions are required to be integer. is for the situation when winning the auction and the second summation is that when losing. Similarly, the optimal action in state is
(5) 
where the optimal bid action involves three terms: , and . is derived by marginalizing out :
(6)  
To settle the integration over in Eq. (4), we consider an approximation by following the dependency assumption in [32]. Thus
(7) 
where is the expectation of the pCTR , which can be easily calculated with historical data. Taking Eq. (4) into Eq. (4), we get an approximation of the optimal value function :
(8) 
Noticing that , Eq. (4) is rewritten as
(9) 
where we denote . From the definition, we know monotonically increases w.r.t. , i.e., where . As such, monotonically decreases w.r.t. . Thus monotonically decreases w.r.t. . Moreover, and . Here, we care about the value of . (i) If , then where , so where . As a result, in this case, we have . (ii) If , then there must exist an integer such that and , . So when and when . Consequently, in this case, we have . In conclusion, we have
(10) 
Figure 2 shows examples of on campaign 2821 from iPinYou realworld dataset. Additionally, we usually have a maximum market price , which is also the maximum bid price. The corresponding RLB algorithm is shown in Algorithm 1.
Discussion on Derived Policy.
Contrary to the linear bidding strategies which bids linearly w.r.t. the pCTR with a static parameter [18], such as Mcpc and Lin discussed in Section 5.3, our derived policy (denoted as RLB) adjusts its bidding function according to current and . As shown in Figure 3, RLB also introduces a linear form bidding function when is large, but decreases the slope w.r.t. decreasing and increasing . When is small (such as ), RLB introduces a nonlinear concave form bidding function.
Discussion on the Approximation of . In Eq. (4), we take the approximation by following the dependency assumption in [32] and consequently get an approximation of the optimal value function in Eq. (4). Here, we consider a more general case where such assumption does not hold in the whole feature vector space , but holds within each individual subset. Suppose can be explicitly divided into several segments, i.e., . The segmentation can be built by publisher, user demographics etc. For each segment , we take the approximation where . As such, we have
Algorithm 1 gives a solution to the optimal policy. However, when it comes to the realworld scale, we should also consider the complexity of the algorithm. Algorithm 1 consists of two stages. The first one is about updating the value function , while the second stage is about taking the optimal action for current state based on . We can see that the main complexity is on the first stage. Thus we focus on the first stage in this section. Two nested loops in the first stage lead the time complexity to . As for the space complexity, we need to use a twodimensional table to store , which will later be used when taking action. Thus the space complexity is .
In consideration of the space complexity and the time complexity, Algorithm 1 can only be applied to smallscale situations. When we confront the situation where and are very large, which is a common case in real world, there will probably be not enough resource to get the exact value of for every .
With restricted computational resources, one may not be able to go through the whole value function update. Thus we propose some parameterized models to fit the value function on small data scale, i.e., , and generalize to the large data scale .
Good parameterized models are supposed to have low deviation to the exact value of for every . That means low root mean square error (RMSE) in the training data and good generalization ability.
Basically, we expect the prediction error of from Eq. (4) in the training data to be low in comparison to the average CTR . For most , is much larger than . For example, if the budget is large enough, is with the same scale of . Therefore, if we take as our target to approximate, it is difficult to give a low deviation in comparison to . Actually, when calculating in Eq. (4), we care about the value of rather than or . Thus here we introduce a new function of value differential to replace the role of by
(11) 
Figure 4 illustrates the value of and on the data of an example campaign. In Figure 5, we use the campaign 3386 from iPinYou realworld dataset as an example and show some interesting observations of (other campaigns are similar). At first, for a given , we consider as a function of and denote it as . fluctuates heavily when is very small, and later keeps decreasing to 0. Similarly, for a given , we have as a function of and it keeps increasing. Moreover, both and are obviously nonlinear. Consequently, we apply the neural networks to approximate them for largescale and .
As a widely used solution [2, 21], here we take a fully connected neural network with several hidden layers as a nonlinear approximator. The input layer has two nodes for and . The output layer has one node for
without activation function. As such, the neural network corresponds to a nonlinear function of
and , denoted as .Coarsetofine Episode Segmentation Model. Since the neural networks do not guarantee good generalization ability and may suffer from overfitting, and also to avoid directly modeling or , we explore the feasibility of mapping unseen states ( and ) to acquainted states ( and ) rather than giving a global parameterized representation. Similar to budget pacing, we have the first simple implicit mapping method where we can divide the large episode into several small episodes with length and within each large episode we allocate the remaining budget to the remaining small episodes. If the agent does not spend the budget allocated for the small episode, it will have more allocated money for the rest of the small episodes in the large episode.
State Mapping Models. Also we consider explicit mapping methods. At first, because keeps decreasing and keeps increasing, then for where and are large, there should be some points where and such that as is shown in Figure 4, which confirms the existence of the mapping for . Similarly, decreases w.r.t. and increases w.r.t. , which can be seen in Figure 3 and is consistent with intuitions. Thus the mapping for also exists. From the view of practical bidding, when the remaining number of auctions are large and the budget situation is similar, given the same bid request, the agent should give a similar bid price (see Figure 2). We consider a simple case that represents the budget condition. Then here we have two linear mapping forms: (i) map where to . (ii) map where to . Denote as . Figure 6 shows that the deviations of the simple linear mapping method are low enough ().
Two realworld datasets are used in our experimental study, namely iPinYou and YOYI.
iPinYou is one of the mainstream RTB ad companies in China. The whole dataset comprises 19.5M impressions, 14.79K clicks and 16.0K CNY expense on 9 different campaigns over 10 days in 2013. We follow [31] for splitting the train/test sets and feature engineering.
YOYI is a leading RTB company focusing on multidevice display advertising in China. YOYI dataset comprises 441.7M impressions, 416.9K clicks and 319.5K CNY expense during 8 days in Jan. 2016. The first 7 days are set as the training data while the last day is set as the test data.
For experiment reproducibility we publicize our code^{3}^{3}3The experiment code is available at https://github.com/hancai/rlbdp and iPinYou dataset is available at http://data.computationaladvertising.org.. In the paper we mainly report results on iPinYou dataset, and further verify our algorithms over the YOYI dataset as supplementary.
The evaluation is from the perspective of an advertiser’s campaign with a predefined budget and lifetime (episode length).
Evaluation metrics. The main goal of the bidding agent is to optimise the campaign’s KPI (e.g., clicks, conversions, revenue, etc.) given the campaign budget. In our work, we consider the number of acquired clicks as the KPI, which is set as the primary evaluation measure in our experiments. We also analyze other statistics such as win rate, cost per mille impressions (CPM) and effective cost per click (eCPC).
Evaluation flow. We mostly follow [32] when building the evaluation flow, except that we divide the test data into episodes. Specifically, the test data is a list of records, each of which consists of the bid request feature vector, the market price and the user response (click) label. We divide the test data into episodes, each of which contains records and is allocated with a budget . Given the CTR estimator and the bid landscape forecasting, the bidding strategy goes through the test data episode by episode. Specifically, the bidding strategy generates a price for each bid request (the bid price cannot exceed current budget). If the bid price is higher than or equal to the market price of the bid request, the advertiser wins the auction and then receives the market price as cost and the user click as reward and then updates the remaining auction number and budget.
Budget constraints. Obviously, if the allocated budget is too high, the bidding strategy can simply give a very high bid price each time to win all clicks in the test data. Therefore, in evaluation, budget constraints should not be higher than the historic total cost of the test data. We determine the budget in this way: , where is the cost per mille impressions in the training data and acts as the budget constraints parameter. Following previous work [32, 31], we run the evaluation with .
Episode length. The episode auction number influences the complexity of our algorithms. When is high, the original Algorithm 1 is not capable of working with limited resources, which further leads to our largescale algorithms. For the largescale evaluation, we set as 100,000, which corresponds to a realworld 10minute auction volume of a mediumscale RTB ad campaign. And for the smallscale evaluation, we set the episode length as 1,000. In addition, we run a set of evaluations with and the episode length to give a more comprehensive performance analysis.
The following bidding policies are compared with the same CTR estimation component which is a logistic regression model and the same bid landscape forecasting component which is a nonparametric method, as described in Section 2:
is based on [1], considering the bid landscape but ignoring the feature vector of bid request when giving the bid price. Although we regard this model as the stateoftheart, it is proposed to work on keywordlevel bidding in sponsored search, which makes it not finegrained enough to compare with RTB display advertising strategies.
gives its bidding strategy as CPC , which matches some advertisers’ requirement of maximum CPC (cost per click).
is a linear bidding strategy w.r.t. the pCTR: , where is the basic bid price and is tuned using the training data [18]. This is the most widely used model in industry.
is our proposed model for the smallscale problem as shown in Algorithm 1.
is our proposed model for the largescale problem, which uses the neural network to approximate .
combines the neural network with episode segmentation. For each small episode, the allocated budget is where is the remaining budget of the current large episode and is the remaining number of small episodes in the current large episode. Then RLBNN is run for the small episode. It corresponds to the coarsetofine episode segmentation model discussed in Section 4.1.
combines the neural network with the mapping of . That is: (i) where . (ii) where .
combines the neural network with the mapping of . That is: where . The last two models correspond to the state mapping models discussed in Section 4.1.
iPinYou  1/32  1/16  1/8  1/4  1/2 

1458  4.66%  3.96%  3.25%  0.21%  1.02% 
2259  114.29%  35.29%  9.09%  32.56%  22.22% 
2261  25.00%  6.25%  3.70%  6.82%  0.00% 
2821  20.00%  11.86%  27.27%  29.36%  12.97% 
2997  23.81%  54.55%  85.26%  13.04%  3.18% 
3358  2.42%  3.30%  0.87%  3.02%  0.40% 
3386  8.47%  22.47%  13.24%  14.57%  13.40% 
3427  7.58%  10.04%  12.28%  6.88%  5.34% 
3476  4.68%  3.79%  2.50%  5.43%  0.72% 
Average  22.39%  15.99%  16.67%  12.43%  6.58% 
YOYI  3.89%  2.26%  7.41%  3.48%  1.71% 
In this section we present the experimental results on small and largescale data settings respectively.
The performance comparison on iPinYou dataset under and different budget conditions are reported in Figure 7. In the comparison on total clicks (upper left plot), we find that (i) our proposed model RLB performs the best under every budget condition, verifying the effectiveness of the derived algorithm for optimizing attained clicks. (ii) Lin has the second best performance, which is a widely used bidding strategy in industry [18]. (iii) Compared to RLB and Lin, Mcpc does not adjust its strategy when the budget condition changes. Thus it performs quite well when but performs poorly on very limited budget conditions, which is consistent with the discussion in Section 2. (iv) SSMDP gives the worst performance, since it is unaware of the feature information of each bid request, which shows the advantages of RTB display advertising.
As for the comparison on win rate, CPM and eCPC, we observe that (i) under every budget condition, SSMDP keeps the highest win rate. The reason is that SSMDP considers each bid request equally, thus its optimization target is equivalent to the number of impressions. Therefore, its win rate should be the highest. (ii) Lin and RLB are very close in comparison on CPM and eCPC. RLB can generate a higher number of clicks with comparable CPM and eCPC against Lin because RLB effectively spends the budget according to the market situation, which is unaware of by Lin.
Table 2 provides a detailed performance on clicks of RLB over Lin under various campaigns and budget conditions. Among all 50 settings, RLB wins Lin in 46 (92%), ties in 1 (2%) and loses in 3 (6%) settings. It shows that RLB is robust and significantly outperforms Lin in the vast majority of the cases. Specifically, for 1/8 budget, RLB outperforms Lin by 16.7% on iPinYou data and 7.4% on YOYI data. Moreover, Figure 8 shows the performance comparison under the same budget condition () and different episode lengths. The findings are similar to the above results. Compared to Lin, RLB can attain more clicks with similar eCPC. Note that, in offline evaluations the total auction number is stationary, larger episode length also means smaller episode number. Thus the total click numbers in Figure 8 do not increase largely w.r.t. .
SSMDP is the only model that ignores the feature information of the bid request, thus providing a poor overall performance. Table 3 reports in detail the clicks along with the AUC of the CTR estimator for each campaign. We find that when the performance of the CTR estimator is relatively low (AUC ), e.g., campaign 2259, 2261, 2821, 2997, the performance of SSMDP on clicks is quite good in comparison to Mcpc and Lin. By contrast, when the performance of the CTR estimator gets better, other methods which utilize the CTR estimator can attain much more clicks than SSMDP.
iPinYou  AUC of  SSMDP  Mcpc  Lin  RLB 

1458  97.73%  42  405  455  473 
2259  67.90%  13  11  17  23 
2261  62.16%  16  12  16  17 
2821  62.95%  49  38  59  66 
2997  60.44%  116  82  77  119 
3358  97.58%  15  144  212  219 
3386  77.96%  24  56  89  109 
3427  97.41%  20  178  279  307 
3476  95.84%  38  103  211  203 
Average  80.00%  37  114  157  170 
YOYI  87.79%  120  196  265  271 
In this section, we first run the value function update in Algorithm 1 under and , then train a neural network with the attained data (where ). Here we use a fully connected neural network with two hidden layers which use tanh activation function. The first hidden layer has 30 hidden nodes and the second one has 15 hidden nodes. Next, we apply the neural network to run bidding under and . In addition, SSMDP is not tested in this experiment because it suffers from scalability issues and will have a similarly low performance as in the smallscale evaluation.
Table 4 shows the performance of the neural network on iPinYou and YOYI. We can see that the RMSE is relatively low in comparison to , which means that the neural network can provide a good approximation to the exact algorithm when the agent comes to a state where .
Figure 9 shows the performance comparison on iPinYou under and different budget conditions. We observe that (i) Mcpc has a similar performance to that observed in smallscale situations. (ii) For total clicks, RLBNN performs better than Lin under 1/32, 1/16, 1/8 and performs worse than Lin under 1/2, which shows that the generalization ability of the neural network is satisfactory only in small scales. For relatively large scales, the generalization of RLBNN is not reliable. (iii) Compared to RLBNN, the 3 sophisticated algorithms RLBNNSeg, RLBNNMapD and RLBNNMapA are more robust and outperform Lin under every budget condition. They do not rely on the generalization ability of the approximation model, therefore their performance is more stable. The results clearly demonstrate that they are effective solutions for the largescale problem. (iv) As for eCPC, all models except from Mcpc are very close, thus making the proposed RLB algorithms practically effective.
iPinYou  YOYI  

RMSE ()  0.998  1.263 
RMSE / ()  9.404  11.954 
Our proposed RLB model is deployed and tested in a live environment provided by Vlion DSP. The deployment environment is based on HP ProLiant DL360p Gen8 servers. A 5node cluster is utilized for the bidding agent, where each node is in CentOS release 6.3, with 6 core Intel Xeon CPU E52620 (2.10GHz) and 64GB RAM. The model is implemented in Lua with Nginx.
The compared bidding strategy is Lin as discussed in Section 5.3. The optimization target is click. The two compared methods are given the same budget, which is further allocated to episodes. Unlike offline evaluations, the online evaluation flow stops only when the budget is exhausted. Within an episode, a maximum bid number is set for each strategy to prevent overspending too much. Specifically, is mostly determined by the allocated budget for the episode , previous CPM and win rate: . The possible available auction number during the episode is also considered when determining . The agent keeps the remaining bid number and budget, which we consider as and respectively. Note that the remaining budget may have some deviation due to latency. The latency is typically less than 100ms, which is negligible. We test over 5 campaigns during 2528th of July, 2016. All the methods share the same previous 7day training data, and the same CTR estimator which is a logistic regression model trained with FTRL. The bid requests of each user are randomly sent to either method. The overall results are presented in Figure 10, while the click and cost performances w.r.t. time are shown in Figure 11.
From the comparison, we observe the following: (i) with the same cost, RLB achieves lower eCPC than Lin, and thus more total clicks, which shows the cost effectiveness of RLB. (ii) RLB provides better planning than Lin: the acquired clicks and spent budget increase evenly across the time. (iii) With better planning, RLB obtains lower CPM than Lin, yielding more bids and more winning impressions. (iv) With lower CPM on cheap cases, RLB achieves a close CTR compared to Lin, which leads to superior performance. In summary, the online evaluation demonstrates the effectiveness of our proposed RLB model for optimizing attained clicks with a good pacing.
In this paper, we proposed a modelbased reinforcement learning model (RLB) for learning the bidding strategy in RTB display advertising. The bidding strategy is naturally defined as the policy of making a bidding action given the state of the campaign’s parameters and the input bid request information. With an MDP formulation, the state transition and reward function are captured via modeling the auction competition and user click, respectively. The optimal bidding policy is then derived using dynamic programming. Furthermore, to deal with the largescale auction volume and campaign budget, we proposed neural network models to fit the differential of the values between two consecutive states. Experimental results on two realworld largescale datasets and online A/B test demonstrated the superiority of our RLB solutions over several strong baselines and stateoftheart methods, as well as their high efficiency to handle largescale data.
For future work, we will investigate modelfree approaches such as Qlearning and policy gradient methods to unify utility estimation, bid landscape forecasting and bid optimization into a single optimization framework and handle the highly dynamic environment. Also, since RLB naturally tackles the problem of budget over or underspending across the campaign lifetime, we will compare our RLB solutions with the explicit budget pacing techniques [13, 28].
We sincerely thank the engineers from YOYI DSP to provide us the offline experiment dataset and the engineers from Vlion DSP to help us conduct online A/B tests.
Comments
There are no comments yet.