Real-Time Bidding by Reinforcement Learning in Display Advertising

01/10/2017 ∙ by Han Cai, et al. ∙ Shanghai Jiao Tong University UCL 0

The majority of online display ads are served through real-time bidding (RTB) --- each ad display impression is auctioned off in real-time when it is just being generated from a user visit. To place an ad automatically and optimally, it is critical for advertisers to devise a learning algorithm to cleverly bid an ad impression in real-time. Most previous works consider the bid decision as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget runs out. As such, each bid is strategically correlated by the constrained budget and the overall effectiveness of the campaign (e.g., the rewards from generated clicks), which is only observed after the campaign has completed. Thus, it is of great interest to devise an optimal bidding strategy sequentially so that the campaign budget can be dynamically allocated across all the available impressions on the basis of both the immediate and future rewards. In this paper, we formulate the bid decision process as a reinforcement learning problem, where the state space is represented by the auction information and the campaign's real-time parameters, while an action is the bid price to set. By modeling the state transition via auction competition, we build a Markov Decision Process framework for learning the optimal bidding policy to optimize the advertising performance in the dynamic real-time bidding environment. Furthermore, the scalability problem from the large real-world auction volume and campaign budget is well handled by state value approximation using neural networks.



There are no comments yet.


page 6

page 8

page 9

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The increased availability of big data and the improved computational power have advanced machine learning and artificial intelligence for various prediction and decision making tasks. In particular, the successful application of reinforcement learning in certain settings such as gaming control

[19] has demonstrated that machines not only can predict, but also have a potential of achieving comparable human-level control and decision making. In this paper, we study machine bidding in the context of display advertising. Auctions, particularly real-time bidding (RTB), have been a major trading mechanism for online display advertising [30, 7, 23]. Unlike the keyword-level bid decision in sponsored search [1], the advertiser needs to make the impression-level bid decision in RTB, i.e., bidding for every single ad impression in real time when it is just being generated by a user visit [18, 30] . Machine based bid decision, i.e., to calculate the strategic amount that the advertiser would like to pay for an ad opportunity, constitutes a core component that drives the campaigns’ ROI [11, 32]. By calculating an optimal bid price for each ad auction (also considering the remaining budget and the future availability of relevant ad impressions in the ad exchange) and then observing the auction result and user feedback, the advertiser would be able to refine their bidding strategy and better allocate the campaign budget across the online page view volume.

A straightforward bidding strategy in RTB display advertising is mainly based on the truthfulness of second-price auctions [6], which means the bid price for each ad impression should be equal to its true value, i.e., the action value (e.g., click value) multiplied by the action rate (e.g., click-through rate) [14]. However, for budgeted bidding in repeated auctions, the optimal bidding strategy may not be truthful but depends on the market competition, auction volume and campaign budget [31]. In [18, 32], researchers have proposed to seek the optimal bidding function that directly maximizes the campaign’s key performance indicator (KPI), e.g., total clicks or revenue, based on the static distribution of input data and market competition models. Nevertheless, such static bid optimization frameworks may still not work well in practice because the RTB market competition is highly dynamic and it is fairly common that the true data distribution heavily deviates from the assumed one during the model training [4], which requires additional control step such as budget pacing [28] to constrain the budget spending.

In this paper, we solve the issue by considering bidding as a sequential decision, and formulate it as a reinforcement learning to bid (RLB) problem. From an advertiser’s perspective, the whole ad market and Internet users are regarded as the environment. At each timestep, the advertiser’s bidding agent observes a state, which consists of the campaign’s current parameters, such as the remaining lifetime and budget, and the bid request for a specific ad impression (containing the information about the underlying user and their context) . With such state (and context), the bidding agent makes a bid action for its ad. After the ad auction, the winning results with the cost and the corresponding user feedback will be sent back to the bidding agent, which forms the reward signal of the bidding action. Thus, the bid decision aims to derive an optimal bidding policy for each given bid request.

With the above settings, we build a Markov Decision Process (MDP) framework for learning the optimal bidding policy to optimize the advertising performance. The value of each state will be calculated by performing dynamic programming. Furthermore, to handle the scalability problem for the real-world auction volume and campaign budget, we propose to leverage a neural network model to approximate the value function. Besides directly generalizing the neural network value function, we also propose a novel coarse-to-fine episode segmentation model and state mapping models to overcome the large-scale state generalization problem.

In our empirical study, the proposed solution has achieved 16.7% and 7.4% performance gains against the state-of-the-art methods on two large-scale real-world datasets. In addition, our proposed system has been deployed into a commercial RTB platform. We have performed an online A/B testing, where a 44.7% improvement in click performance was observed against a most widely used method in the industry.

2 Background and Related Work

Reinforcement Learning. An MDP provides a mathematical framework which is widely used for modelling the dynamics of an environment under different actions, and is useful for solving reinforcement learning problems [21]. An MDP is defined by the tuple . The set of all states and actions are represented by and

respectively. The reward and transition probability functions are given by

and . Dynamic programming is used in cases where the environment’s dynamics, i.e., the reward function and transition probabilities are known in advance. Two popular dynamic programming algorithms are policy iteration and value iteration. For large-scale situations, it is difficult to experience the whole state space, which leads to the use of function approximation that constructs an approximator of the entire function [8, 22]. In this work, we use value iteration for small-scale situations, and further build a neural network approximator to solve the scalability problem.

RTB Strategy. In the RTB process [32]

, the advertiser receives the bid request of an ad impression with its real-time information and the very first thing to do is to estimate the

utility, i.e., the user’s response on the ad if winning the auction. The distribution of the cost, i.e., the market price [1, 5], which is the highest bid price from other competitors, is also forecasted by the bid landscape forecasting component. Utility estimation and bid landscape forecasting are described below. Given the estimated utility and cost factors, the bidding strategy [30]

decides the final bid price with accessing the information of the remaining budget and auction volume. Thus it is crucial to optimize the final bidding strategy considering the market and bid request information with budget constraints. A recent comprehensive study on the data science of RTB display advertising is posted in


Utility Estimation. For advertisers, the utility is usually defined based on the user response, i.e., click or conversion, and can be modeled as a probability estimation task [14]. Much work has been proposed for user response prediction, e.g., click-through rate (CTR) [16], conversion rate (CVR) [14] and post-click conversion delay patterns [3]

. For modeling, linear models such as logistic regression

[14] and non-linear models such as boosted trees [10] and factorization machines [17] are widely used in practice. There are also online learning models that immediately perform updating when observing each data instance, such as Bayesian probit regression [9], FTRL learning in logistic regression [16]. In our paper, we follow [14, 32] and adopt the most widely used logistic regression for utility estimation to model the reward on agent actions.

Bid Landscape Forecasting. Bid landscape forecasting refers to modeling the market price distribution for auctions of specific ad inventory, and its c.d.f. is the winning probability given each specific bid price [5]. The authors in [15, 32, 5]

presented some hypothetical winning functions and learned the parameters. For example, a log-normal market price distribution with the parameters estimated by gradient boosting decision trees was proposed in

[5]. Since advertisers only observe the winning impressions, the problem of censored data [1, 33] is critical. Authors in [27]

proposed to leverage censored linear regression to jointly model the likelihood of observed market prices in winning cases and censored ones with losing bids. Recently, the authors in

[25] proposed to combine survival analysis and decision tree models, where each tree leaf maintains a non-parametric survival model to fit the censored market prices. In this paper, we follow [1, 33] and use a non-parametric method to model the market price distribution.

Bid Optimization. As has been discussed above, bidding strategy optimization is the key component within the decision process for the advertisers [18]. The auction theory [12] proved that truthful bidding is the optimal strategy under the second-price auction. However, truthful bidding may perform poorly when considering the multiple auctions and the budget constraint [32]. In real-world applications, the linear bidding function [18] is widely used. The authors in [32]

empirically showed that there existed non-linear bidding functions better than the linear ones under variant budget constraints. When the data changes, however, the heuristic model

[18] or hypothetical bidding functions [31, 32] cannot depict well the real data distribution. The authors in [1, 29] proposed the model-based MDPs to derive the optimal policy for bidding in sponsored search or ad selection in contextual advertising, where the decision is made on keyword level. In our work, we investigate the most challenging impression-level bid decision problem in RTB display advertising that is totally different from [1, 29]. We also tackle the scalability problem, which remains unsolved in [1], and demonstrate the efficiency and effectiveness of our method in a variety of experiments.

3 Problem and Formulation

In a RTB ad auction, each bidding agent acts on behalf of its advertiser and generates bids to achieve the campaign’s specific target. Our main goal is to derive the optimal bidding policy in a reinforcement learning fashion. For most performance-driven campaigns, the optimization target is to maximize the user responses on the displayed ads if the bid leads to auction winning. Without loss of generality, we consider clicks as our targeted user response objective, while other KPIs can be adopted similarly. The diagram of interactions between the bidding agent and the environment is shown in Figure 1.

3.1 Problem Definition

Figure 1: Diagram of reinforcement learning to bid.

Mathematically, we consider bidding in display advertising as an episodic process [1]; each episode comprises

auctions which are sequentially sent to the bidding agent. Each auction is represented by a high dimensional feature vector

, which is indexed via one-hot binary encoding. Each entry of corresponds to a category in a field, such as the category London in the field City, and the category Friday in the field Weekday. The fields consist of the campaign’s ad information (e.g., ad creative ID and campaign ID) and the auctioned impression contextual information (e.g., user cookie ID, location, time, publisher domain and URL).

At the beginning, the agent is initialized with a budget , and the advertising target is set to acquire as many clicks as possible during the following auctions. Three main pieces of information are considered by the agent (i) the remaining auction number ; (ii) the unspent budget and (iii) the feature vector . During the episode, each auction will be sent to the agent sequentially and for each of them the agent needs to decide the bid price according to the current information , and .

The agent maintains the remaining number of auctions and the remaining budget . At each timestep, the agent receives an auction (the feature vector space), and determines its bid price . We denote the market price p.d.f. given as , where is the market price and is its probability. If the agent bids at price , then it wins the auction and pays , and the remaining budget changes to . In this case, the agent can observe the user response and the market price later. Alternatively, if losing, the agent gets nothing from the auction. We take predicted CTR (pCTR) as the expected reward, to model the action utility. After each auction, the remaining number of auctions changes to . When the auction flow of this episode runs out, the current episode ends and both the remaining auction number and budget are reset.

3.2 MDP Formulation of RTB

A Markov Decision Process (MDP) provides a framework that is widely used for modeling agent-environment interactions [21]. Our notations are listed in Table 1. An MDP can be represented by a tuple , where and are two sets of all states and all possible actions in state , respectively. represents the state transition probability from state to another state when taking action , which is denoted by . Similarly, is the reward function denoted by that represents the reward received after taking action in state and then transiting to state .

Notation Description
The feature vector that represents a bid request.
The whole feature vector space.

The probability density function of

The predicted CTR (pCTR) if winning the auction of .
The p.d.f. of market price given .
The p.d.f. of market price .
The expected total reward with starting state ,
taking the optimal policy.
The expected total reward with starting state ,
taking the optimal policy.
The optimal action in state .
Table 1: A summary of our notations.

We consider as a state 111For simplicity, we slightly abuse the notation by including in the state. and assume the feature vector is drawn i.i.d. from the probability density function . The full state space is . And if , the state is regarded as a terminal state which means the end of the episode. The set of all actions available in state is , corresponding to the bid price. Furthermore, in state where , the agent, when bidding , can transit to with probability where and . That is the case of winning the auction and receiving a reward . And the agent may lose the auction whereafter transit to with probability , where . All other transitions are impossible because of the auction process. In summary, transition probabilities and reward function can be written as:


where , and . Specifically, the first equation is the transition when giving a bid price , while the second equation is the transition when losing the auction.

A deterministic policy is a mapping from each state to action , i.e., , which corresponds to the bidding strategy in RTB display advertising. According to the policy , we have the value function : the expected sum of rewards upon starting in state and obeying policy . This satisfies the Bellman equation with the discount factor since in our scenario the total click number is the optimization target, regardless of the click time.


The optimal value function is defined as . We also have the optimal policy as:


which gives the optimal action at each state and . The optimal policy is exactly the optimal bidding strategy we want to find. For notation simplicity, in later sections, we use to represent the optimal value function .

One may consider the possibility of model-free approaches [26, 20] to directly learn the bidding policy from experience. However, such model-free approaches may suffer from the problems of transition dynamics of the enormous state space, the sparsity of the reward signals and the highly stochastic environment. Since there are many previous works on modeling the utility (reward) and the market price distribution (transition probability) as discussed in Section 2, we take advantage of them and propose our model-based solution for this problem.

4 Dynamic Programming Solution

In a small scale, Eq. (3) can be solved using a dynamic programming approach. As defined, we have the optimal value function , where represents the state . Meanwhile, we consider the situations where we do not observe the feature vector ; so another optimal value function is : the expected total reward upon starting in without observing the feature vector when the agent takes the optimal policy. It satisfies . Also that, we have the optimal policy and express it as the optimal action .

From the definition, we have as the agent gets nothing when there are no remaining auctions. Combined with the transition probability and reward function described in Eq. (3.2), the definition of , can be expressed with as


where the first summation222In practice, the bid prices in various RTB ad auctions are required to be integer. is for the situation when winning the auction and the second summation is that when losing. Similarly, the optimal action in state is


where the optimal bid action involves three terms: , and . is derived by marginalizing out :


To settle the integration over in Eq. (4), we consider an approximation by following the dependency assumption in [32]. Thus


where is the expectation of the pCTR , which can be easily calculated with historical data. Taking Eq. (4) into Eq. (4), we get an approximation of the optimal value function :


Noticing that , Eq. (4) is rewritten as


where we denote . From the definition, we know monotonically increases w.r.t. , i.e., where . As such, monotonically decreases w.r.t. . Thus monotonically decreases w.r.t. . Moreover, and . Here, we care about the value of . (i) If , then where , so where . As a result, in this case, we have . (ii) If , then there must exist an integer such that and , . So when and when . Consequently, in this case, we have . In conclusion, we have

Figure 2: on campaign 2821 as an example.

Figure 2 shows examples of on campaign 2821 from iPinYou real-world dataset. Additionally, we usually have a maximum market price , which is also the maximum bid price. The corresponding RLB algorithm is shown in Algorithm 1.

0:  p.d.f. of market price , average CTR , episode length , budget
0:  value function
1:  initialize = 0
2:  for  do
3:     for  do
4:         enumerate from 0 to and set via Eq. (4)
5:     end for
6:  end for
0:  CTR estimator , value function , current state
0:  optimal bid price in current state
1:  calculate the pCTR for the current bid request:
2:  for  do
3:     if  then
5:     end if
6:  end for
Algorithm 1 Reinforcement Learning to Bid

Discussion on Derived Policy.

Figure 3: Example derived bidding functions.

Contrary to the linear bidding strategies which bids linearly w.r.t. the pCTR with a static parameter [18], such as Mcpc and Lin discussed in Section 5.3, our derived policy (denoted as RLB) adjusts its bidding function according to current and . As shown in Figure 3, RLB also introduces a linear form bidding function when is large, but decreases the slope w.r.t. decreasing and increasing . When is small (such as ), RLB introduces a non-linear concave form bidding function.

Discussion on the Approximation of . In Eq. (4), we take the approximation by following the dependency assumption in [32] and consequently get an approximation of the optimal value function in Eq. (4). Here, we consider a more general case where such assumption does not hold in the whole feature vector space , but holds within each individual subset. Suppose can be explicitly divided into several segments, i.e., . The segmentation can be built by publisher, user demographics etc. For each segment , we take the approximation where . As such, we have

4.1 Handling Large-Scale Issues

Algorithm 1 gives a solution to the optimal policy. However, when it comes to the real-world scale, we should also consider the complexity of the algorithm. Algorithm 1 consists of two stages. The first one is about updating the value function , while the second stage is about taking the optimal action for current state based on . We can see that the main complexity is on the first stage. Thus we focus on the first stage in this section. Two nested loops in the first stage lead the time complexity to . As for the space complexity, we need to use a two-dimensional table to store , which will later be used when taking action. Thus the space complexity is .

In consideration of the space complexity and the time complexity, Algorithm 1 can only be applied to small-scale situations. When we confront the situation where and are very large, which is a common case in real world, there will probably be not enough resource to get the exact value of for every .

With restricted computational resources, one may not be able to go through the whole value function update. Thus we propose some parameterized models to fit the value function on small data scale, i.e., , and generalize to the large data scale .

Good parameterized models are supposed to have low deviation to the exact value of for every . That means low root mean square error (RMSE) in the training data and good generalization ability.

Basically, we expect the prediction error of from Eq. (4) in the training data to be low in comparison to the average CTR . For most , is much larger than . For example, if the budget is large enough, is with the same scale of . Therefore, if we take as our target to approximate, it is difficult to give a low deviation in comparison to . Actually, when calculating in Eq. (4), we care about the value of rather than or . Thus here we introduce a new function of value differential to replace the role of by

Figure 4: and on campaign 3427.
Figure 5: Analysis of on campaign 3386.

Figure 4 illustrates the value of and on the data of an example campaign. In Figure 5, we use the campaign 3386 from iPinYou real-world dataset as an example and show some interesting observations of (other campaigns are similar). At first, for a given , we consider as a function of and denote it as . fluctuates heavily when is very small, and later keeps decreasing to 0. Similarly, for a given , we have as a function of and it keeps increasing. Moreover, both and are obviously nonlinear. Consequently, we apply the neural networks to approximate them for large-scale and .

As a widely used solution [2, 21], here we take a fully connected neural network with several hidden layers as a non-linear approximator. The input layer has two nodes for and . The output layer has one node for

without activation function. As such, the neural network corresponds to a non-linear function of

and , denoted as .

Coarse-to-fine Episode Segmentation Model. Since the neural networks do not guarantee good generalization ability and may suffer from overfitting, and also to avoid directly modeling or , we explore the feasibility of mapping unseen states ( and ) to acquainted states ( and ) rather than giving a global parameterized representation. Similar to budget pacing, we have the first simple implicit mapping method where we can divide the large episode into several small episodes with length and within each large episode we allocate the remaining budget to the remaining small episodes. If the agent does not spend the budget allocated for the small episode, it will have more allocated money for the rest of the small episodes in the large episode.

State Mapping Models. Also we consider explicit mapping methods. At first, because keeps decreasing and keeps increasing, then for where and are large, there should be some points where and such that as is shown in Figure 4, which confirms the existence of the mapping for . Similarly, decreases w.r.t. and increases w.r.t. , which can be seen in Figure 3 and is consistent with intuitions. Thus the mapping for also exists. From the view of practical bidding, when the remaining number of auctions are large and the budget situation is similar, given the same bid request, the agent should give a similar bid price (see Figure 2). We consider a simple case that represents the budget condition. Then here we have two linear mapping forms: (i) map where to . (ii) map where to . Denote as . Figure 6 shows that the deviations of the simple linear mapping method are low enough ().

Figure 6: on campaign 3427 as an example.

5 Experimental Setup

5.1 Datasets

Two real-world datasets are used in our experimental study, namely iPinYou and YOYI.

iPinYou is one of the mainstream RTB ad companies in China. The whole dataset comprises 19.5M impressions, 14.79K clicks and 16.0K CNY expense on 9 different campaigns over 10 days in 2013. We follow [31] for splitting the train/test sets and feature engineering.

YOYI is a leading RTB company focusing on multi-device display advertising in China. YOYI dataset comprises 441.7M impressions, 416.9K clicks and 319.5K CNY expense during 8 days in Jan. 2016. The first 7 days are set as the training data while the last day is set as the test data.

For experiment reproducibility we publicize our code333The experiment code is available at and iPinYou dataset is available at In the paper we mainly report results on iPinYou dataset, and further verify our algorithms over the YOYI dataset as supplementary.

5.2 Evaluation Methods

The evaluation is from the perspective of an advertiser’s campaign with a predefined budget and lifetime (episode length).

Evaluation metrics. The main goal of the bidding agent is to optimise the campaign’s KPI (e.g., clicks, conversions, revenue, etc.) given the campaign budget. In our work, we consider the number of acquired clicks as the KPI, which is set as the primary evaluation measure in our experiments. We also analyze other statistics such as win rate, cost per mille impressions (CPM) and effective cost per click (eCPC).

Evaluation flow. We mostly follow [32] when building the evaluation flow, except that we divide the test data into episodes. Specifically, the test data is a list of records, each of which consists of the bid request feature vector, the market price and the user response (click) label. We divide the test data into episodes, each of which contains records and is allocated with a budget . Given the CTR estimator and the bid landscape forecasting, the bidding strategy goes through the test data episode by episode. Specifically, the bidding strategy generates a price for each bid request (the bid price cannot exceed current budget). If the bid price is higher than or equal to the market price of the bid request, the advertiser wins the auction and then receives the market price as cost and the user click as reward and then updates the remaining auction number and budget.

Budget constraints. Obviously, if the allocated budget is too high, the bidding strategy can simply give a very high bid price each time to win all clicks in the test data. Therefore, in evaluation, budget constraints should not be higher than the historic total cost of the test data. We determine the budget in this way: , where is the cost per mille impressions in the training data and acts as the budget constraints parameter. Following previous work [32, 31], we run the evaluation with .

Episode length. The episode auction number influences the complexity of our algorithms. When is high, the original Algorithm 1 is not capable of working with limited resources, which further leads to our large-scale algorithms. For the large-scale evaluation, we set as 100,000, which corresponds to a real-world 10-minute auction volume of a medium-scale RTB ad campaign. And for the small-scale evaluation, we set the episode length as 1,000. In addition, we run a set of evaluations with and the episode length to give a more comprehensive performance analysis.

5.3 Compared Methods

The following bidding policies are compared with the same CTR estimation component which is a logistic regression model and the same bid landscape forecasting component which is a non-parametric method, as described in Section 2:


is based on [1], considering the bid landscape but ignoring the feature vector of bid request when giving the bid price. Although we regard this model as the state-of-the-art, it is proposed to work on keyword-level bidding in sponsored search, which makes it not fine-grained enough to compare with RTB display advertising strategies.


gives its bidding strategy as CPC , which matches some advertisers’ requirement of maximum CPC (cost per click).


is a linear bidding strategy w.r.t. the pCTR: , where is the basic bid price and is tuned using the training data [18]. This is the most widely used model in industry.


is our proposed model for the small-scale problem as shown in Algorithm 1.


is our proposed model for the large-scale problem, which uses the neural network to approximate .


combines the neural network with episode segmentation. For each small episode, the allocated budget is where is the remaining budget of the current large episode and is the remaining number of small episodes in the current large episode. Then RLB-NN is run for the small episode. It corresponds to the coarse-to-fine episode segmentation model discussed in Section 4.1.


combines the neural network with the mapping of . That is: (i) where . (ii) where .


combines the neural network with the mapping of . That is: where . The last two models correspond to the state mapping models discussed in Section 4.1.

Figure 7: Overall performance on iPinYou under and different budget conditions.
iPinYou 1/32 1/16 1/8 1/4 1/2
1458 4.66% 3.96% 3.25% 0.21% 1.02%
2259 114.29% 35.29% 9.09% 32.56% 22.22%
2261 25.00% 6.25% -3.70% 6.82% 0.00%
2821 20.00% 11.86% 27.27% 29.36% 12.97%
2997 23.81% 54.55% 85.26% 13.04% 3.18%
3358 2.42% 3.30% 0.87% 3.02% 0.40%
3386 8.47% 22.47% 13.24% 14.57% 13.40%
3427 7.58% 10.04% 12.28% 6.88% 5.34%
3476 -4.68% -3.79% 2.50% 5.43% 0.72%
Average 22.39% 15.99% 16.67% 12.43% 6.58%
YOYI 3.89% 2.26% 7.41% 3.48% 1.71%
Table 2: Click improvement of RLB over Lin for each campaign under and different budget conditions.

6 Experimental Results

In this section we present the experimental results on small- and large-scale data settings respectively.

6.1 Small-Scale Evaluation

The performance comparison on iPinYou dataset under and different budget conditions are reported in Figure 7. In the comparison on total clicks (upper left plot), we find that (i) our proposed model RLB performs the best under every budget condition, verifying the effectiveness of the derived algorithm for optimizing attained clicks. (ii) Lin has the second best performance, which is a widely used bidding strategy in industry [18]. (iii) Compared to RLB and Lin, Mcpc does not adjust its strategy when the budget condition changes. Thus it performs quite well when but performs poorly on very limited budget conditions, which is consistent with the discussion in Section 2. (iv) SS-MDP gives the worst performance, since it is unaware of the feature information of each bid request, which shows the advantages of RTB display advertising.

As for the comparison on win rate, CPM and eCPC, we observe that (i) under every budget condition, SS-MDP keeps the highest win rate. The reason is that SS-MDP considers each bid request equally, thus its optimization target is equivalent to the number of impressions. Therefore, its win rate should be the highest. (ii) Lin and RLB are very close in comparison on CPM and eCPC. RLB can generate a higher number of clicks with comparable CPM and eCPC against Lin because RLB effectively spends the budget according to the market situation, which is unaware of by Lin.

Table 2 provides a detailed performance on clicks of RLB over Lin under various campaigns and budget conditions. Among all 50 settings, RLB wins Lin in 46 (92%), ties in 1 (2%) and loses in 3 (6%) settings. It shows that RLB is robust and significantly outperforms Lin in the vast majority of the cases. Specifically, for 1/8 budget, RLB outperforms Lin by 16.7% on iPinYou data and 7.4% on YOYI data. Moreover, Figure 8 shows the performance comparison under the same budget condition () and different episode lengths. The findings are similar to the above results. Compared to Lin, RLB can attain more clicks with similar eCPC. Note that, in offline evaluations the total auction number is stationary, larger episode length also means smaller episode number. Thus the total click numbers in Figure 8 do not increase largely w.r.t. .

SS-MDP is the only model that ignores the feature information of the bid request, thus providing a poor overall performance. Table 3 reports in detail the clicks along with the AUC of the CTR estimator for each campaign. We find that when the performance of the CTR estimator is relatively low (AUC ), e.g., campaign 2259, 2261, 2821, 2997, the performance of SS-MDP on clicks is quite good in comparison to Mcpc and Lin. By contrast, when the performance of the CTR estimator gets better, other methods which utilize the CTR estimator can attain much more clicks than SS-MDP.

Figure 8: Overall performance comparison on iPinYou under and different ’s.
iPinYou AUC of SS-MDP Mcpc Lin RLB
1458 97.73% 42 405 455 473
2259 67.90% 13 11 17 23
2261 62.16% 16 12 16 17
2821 62.95% 49 38 59 66
2997 60.44% 116 82 77 119
3358 97.58% 15 144 212 219
3386 77.96% 24 56 89 109
3427 97.41% 20 178 279 307
3476 95.84% 38 103 211 203
Average 80.00% 37 114 157 170
YOYI 87.79% 120 196 265 271
Table 3: Detailed AUC and clicks ( and ).

6.2 Large-Scale Evaluation

In this section, we first run the value function update in Algorithm 1 under and , then train a neural network with the attained data (where ). Here we use a fully connected neural network with two hidden layers which use tanh activation function. The first hidden layer has 30 hidden nodes and the second one has 15 hidden nodes. Next, we apply the neural network to run bidding under and . In addition, SS-MDP is not tested in this experiment because it suffers from scalability issues and will have a similarly low performance as in the small-scale evaluation.

Table 4 shows the performance of the neural network on iPinYou and YOYI. We can see that the RMSE is relatively low in comparison to , which means that the neural network can provide a good approximation to the exact algorithm when the agent comes to a state where .

Figure 9 shows the performance comparison on iPinYou under and different budget conditions. We observe that (i) Mcpc has a similar performance to that observed in small-scale situations. (ii) For total clicks, RLB-NN performs better than Lin under 1/32, 1/16, 1/8 and performs worse than Lin under 1/2, which shows that the generalization ability of the neural network is satisfactory only in small scales. For relatively large scales, the generalization of RLB-NN is not reliable. (iii) Compared to RLB-NN, the 3 sophisticated algorithms RLB-NN-Seg, RLB-NN-MapD and RLB-NN-MapA are more robust and outperform Lin under every budget condition. They do not rely on the generalization ability of the approximation model, therefore their performance is more stable. The results clearly demonstrate that they are effective solutions for the large-scale problem. (iv) As for eCPC, all models except from Mcpc are very close, thus making the proposed RLB algorithms practically effective.

iPinYou YOYI
RMSE () 0.998 1.263
RMSE / () 9.404 11.954
Table 4: Approximation performance of the neural network.
Figure 9: Overall performance on iPinYou under and different budget conditions.

7 Online Deployment and A/B Test

Our proposed RLB model is deployed and tested in a live environment provided by Vlion DSP. The deployment environment is based on HP ProLiant DL360p Gen8 servers. A 5-node cluster is utilized for the bidding agent, where each node is in CentOS release 6.3, with 6 core Intel Xeon CPU E5-2620 (2.10GHz) and 64GB RAM. The model is implemented in Lua with Nginx.

The compared bidding strategy is Lin as discussed in Section 5.3. The optimization target is click. The two compared methods are given the same budget, which is further allocated to episodes. Unlike offline evaluations, the online evaluation flow stops only when the budget is exhausted. Within an episode, a maximum bid number is set for each strategy to prevent overspending too much. Specifically, is mostly determined by the allocated budget for the episode , previous CPM and win rate: . The possible available auction number during the episode is also considered when determining . The agent keeps the remaining bid number and budget, which we consider as and respectively. Note that the remaining budget may have some deviation due to latency. The latency is typically less than 100ms, which is negligible. We test over 5 campaigns during 25-28th of July, 2016. All the methods share the same previous 7-day training data, and the same CTR estimator which is a logistic regression model trained with FTRL. The bid requests of each user are randomly sent to either method. The overall results are presented in Figure 10, while the click and cost performances w.r.t. time are shown in Figure 11.

From the comparison, we observe the following: (i) with the same cost, RLB achieves lower eCPC than Lin, and thus more total clicks, which shows the cost effectiveness of RLB. (ii) RLB provides better planning than Lin: the acquired clicks and spent budget increase evenly across the time. (iii) With better planning, RLB obtains lower CPM than Lin, yielding more bids and more winning impressions. (iv) With lower CPM on cheap cases, RLB achieves a close CTR compared to Lin, which leads to superior performance. In summary, the online evaluation demonstrates the effectiveness of our proposed RLB model for optimizing attained clicks with a good pacing.

Figure 10: Online A/B testing results.
Figure 11: Total clicks and cost increase over episodes.

8 Conclusions

In this paper, we proposed a model-based reinforcement learning model (RLB) for learning the bidding strategy in RTB display advertising. The bidding strategy is naturally defined as the policy of making a bidding action given the state of the campaign’s parameters and the input bid request information. With an MDP formulation, the state transition and reward function are captured via modeling the auction competition and user click, respectively. The optimal bidding policy is then derived using dynamic programming. Furthermore, to deal with the large-scale auction volume and campaign budget, we proposed neural network models to fit the differential of the values between two consecutive states. Experimental results on two real-world large-scale datasets and online A/B test demonstrated the superiority of our RLB solutions over several strong baselines and state-of-the-art methods, as well as their high efficiency to handle large-scale data.

For future work, we will investigate model-free approaches such as Q-learning and policy gradient methods to unify utility estimation, bid landscape forecasting and bid optimization into a single optimization framework and handle the highly dynamic environment. Also, since RLB naturally tackles the problem of budget over- or under-spending across the campaign lifetime, we will compare our RLB solutions with the explicit budget pacing techniques [13, 28].


We sincerely thank the engineers from YOYI DSP to provide us the offline experiment dataset and the engineers from Vlion DSP to help us conduct online A/B tests.


  • [1] K. Amin, M. Kearns, P. Key, and A. Schwaighofer. Budget optimization for sponsored search: Censored learning in mdps. UAI, 2012.
  • [2] J. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. NIPS, pages 369–376, 1995.
  • [3] O. Chapelle. Modeling delayed feedback in display advertising. In KDD, 2014.
  • [4] Y. Chen, P. Berkhin, B. Anderson, and N. R. Devanur. Real-time bidding algorithms for performance-based display ad allocation. In KDD, 2011.
  • [5] Y. Cui, R. Zhang, W. Li, and J. Mao. Bid landscape forecasting in online ad exchange marketplace. In KDD, 2011.
  • [6] B. Edelman, M. Ostrovsky, and M. Schwarz. Internet advertising and the generalized second price auction: Selling billions of dollars worth of keywords. Technical report, National Bureau of Economic Research, 2005.
  • [7] Google. The arrival of real-time bidding, 2011.
  • [8] G. J. Gordon. Stable function approximation in dynamic programming. In ICML, pages 261–268, 1995.
  • [9] T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In ICML, 2010.
  • [10] X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, et al. Practical lessons from predicting clicks on ads at facebook. In ADKDD, 2014.
  • [11] K. Hosanagar and V. Cherepanov. Optimal bidding in stochastic budget constrained slot auctions. In EC, 2008.
  • [12] V. Krishna. Auction theory. Academic press, 2009.
  • [13] K.-C. Lee, A. Jalali, and A. Dasdan. Real time bid optimization with smooth budget delivery in online advertising. In ADKDD, 2013.
  • [14] K.-c. Lee, B. Orten, A. Dasdan, and W. Li. Estimating conversion rate in display advertising from past erformance data. In KDD, 2012.
  • [15] X. Li and D. Guan. Programmatic buying bidding strategies with win rate and winning price estimation in real time mobile advertising. In PAKDD. 2014.
  • [16] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, et al. Ad click prediction: a view from the trenches. In KDD, 2013.
  • [17] R. J. Oentaryo, E.-P. Lim, J.-W. Low, D. Lo, and M. Finegold. Predicting response in mobile advertising with hierarchical importance-aware factorization machine. In WSDM, 2014.
  • [18] C. Perlich, B. Dalessandro, R. Hook, O. Stitelman, T. Raeder, and F. Provost. Bid optimizing and inventory scoring in targeted online advertising. In KDD, 2012.
  • [19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 2016.
  • [20] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. Pac model-free reinforcement learning. In ICML, pages 881–888. ACM, 2006.
  • [21] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning, volume 135. MIT Press Cambridge, 1998.
  • [22] G. Taylor and R. Parr. Kernelized value function approximation for reinforcement learning. In ICML, pages 1017–1024. ACM, 2009.
  • [23] J. Wang and S. Yuan. Real-time bidding: A new frontier of computational advertising research. In WSDM, 2015.
  • [24] J. Wang, W. Zhang, and S. Yuan. Display advertising with real-time bidding (RTB) and behavioural targeting. arXiv preprint arXiv:1610.03013, 2016.
  • [25] Y. Wang, K. Ren, W. Zhang, J. Wang, and Y. Yu. Functional bid landscape forecasting for display advertising. In ECML-PKDD, pages 115–131, 2016.
  • [26] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.
  • [27] W. C.-H. Wu, M.-Y. Yeh, and M.-S. Chen. Predicting winning price in real time bidding with censored data. In KDD, 2015.
  • [28] J. Xu, K.-c. Lee, W. Li, H. Qi, and Q. Lu. Smart pacing for effective online ad campaign optimization. In KDD, 2015.
  • [29] S. Yuan and J. Wang. Sequential selection of correlated ads by pomdps. In CIKM, 2012.
  • [30] S. Yuan, J. Wang, and X. Zhao. Real-time bidding for online advertising: measurement and analysis. In ADKDD, 2013.
  • [31] W. Zhang and J. Wang. Statistical arbitrage mining for display advertising. In KDD, 2015.
  • [32] W. Zhang, S. Yuan, and J. Wang. Optimal real-time bidding for display advertising. In KDD, 2014.
  • [33] W. Zhang, T. Zhou, J. Wang, and J. Xu. Bid-aware gradient descent for unbiased learning with censored data in display advertising. In KDD.