Optimizing AD Pruning of Sponsored Search with Reinforcement Learning

by   Yijiang Lian, et al.
Peking University
Baidu, Inc.

Industrial sponsored search system (SSS) can be logically divided into three modules: keywords matching, ad retrieving, and ranking. During ad retrieving, the ad candidates grow exponentially. A query with high commercial value might retrieve a great deal of ad candidates such that the ranking module could not afford. Due to limited latency and computing resources, the candidates have to be pruned earlier. Suppose we set a pruning line to cut SSS into two parts: upstream and downstream. The problem we are going to address is: how to pick out the best K items from N candidates provided by the upstream to maximize the total system's revenue. Since the industrial downstream is very complicated and updated quickly, a crucial restriction in this problem is that the selection scheme should get adapted to the downstream. In this paper, we propose a novel model-free reinforcement learning approach to fixing this problem. Our approach considers downstream as a black-box environment, and the agent sequentially selects items and finally feeds into the downstream, where revenue would be estimated and used as a reward to improve the selection policy. To the best of our knowledge, this is first time to consider the system optimization from a downstream adaption view. It is also the first time to use reinforcement learning techniques to tackle this problem. The idea has been successfully realized in Baidu's sponsored search system, and online long time A/B test shows remarkable improvements on revenue.


page 1

page 2

page 3

page 4


EENMF: An End-to-End Neural Matching Framework for E-Commerce Sponsored Search

E-commerce sponsored search contributes an important part of revenue for...

Optimizing Sponsored Search Ranking Strategy by Deep Reinforcement Learning

Sponsored search is an indispensable business model and a major revenue ...

Deeply Supervised Semantic Model for Click-Through Rate Prediction in Sponsored Search

In sponsored search it is critical to match ads that are relevant to a q...

Search and Score-Based Waterfall Auction Optimization

Online advertising is a major source of income for many online companies...

Generator and Critic: A Deep Reinforcement Learning Approach for Slate Re-ranking in E-commerce

The slate re-ranking problem considers the mutual influences between ite...

Metareasoning in Modular Software Systems: On-the-Fly Configuration using Reinforcement Learning with Rich Contextual Representations

Assemblies of modular subsystems are being pressed into service to perfo...

Learning-To-Ensemble by Contextual Rank Aggregation in E-Commerce

Ensemble models in E-commerce combine predictions from multiple sub-mode...

1. Introduction

Web search engine plays a vital role in our daily life for seeking information. Since submitted queries usually express a clear intent, search engine provides a good platform for advertisers to accurately target clients. On this platform, advertisers need to submit keywords, bids and creatives for their products and services. A Keyword is a short text to be matched towards query traffic, which can be seen as a bridge to link queries and ads. A Creative is some texts that would be shown to users, which generally includes a title and a description (Fig. 1). And the Bid is used to express the advertiser’s value for this keyword traffic.

Figure 1. A typical shown ad has a title, a description and a style.

As shown in Fig. 2, there are three main modules in a sponsored search engine (SSE): keyword matching, ad retrieving and ranking. When a query is issued, the keyword retrieving module would retrieve all the semantically related keywords, and the ad retrieving module would retrieve all the ads of the advertisers who have bought the keywords, filter out geographically or temporal conflicted ones and equip them with compatible styles. Then, these fully equipped ads would be gathered to go through the ranking module, where lots of model predictions (like CTR (click through rate), relevance, etc.) and filtering rules are conducted, then an auction (like GSP) is carried out for the remaining ads. Finally the winners would be shown to users.

Figure 2. 3 main modules in a sponsored search engine.

A serious problem occurring in ad retrieving is that the candidates grow exponentially. For example, a query with high commercial value can retrieve 100 matching keywords, each keyword may be bought by 10 advertisers, then each advertiser might design 10 different creatives for this unit, and 10 different display styles can be chosen. That is, we would get full ad candidates in this case. Directly sending these ads to the ranking module would consume a lot of computation time, which is unacceptable for an industrial online system. So we have to prune some candidates earlier during their expansion. Here we set our pruning line between the creative and the style equipping as shown in Fig. 4. We refer to all the modules below the pruning line as the downstream system. Then a typical ad selection problem emerges: How to select items from candidates to feed into the downstream system such that total revenue can be maximized?

There are several challenges for this problem. Firstly, the real industrial downstream system is very complicated, which usually deals with ad CTR prediction, blacklist filtering, diversity filtering, ad quality checking etc. Without considering the downstream system, ads selected earlier might be filtered out a lot. Secondly, at this moment, the ad candidate is not complete as style information is unavailable yet. Thus we can not obtain a precise CTR estimation which is a key element for winning the auction. Thirdly, this is a NP-hard combinatorial optimization problem.

Some heuristic approaches are widely employed for tackling these challenges, such as sorting all candidates under the revenue estimation and selecting the top

items. Obvious disadvantages of this kind of solutions are that, it is impossible to design perfect rules due to the super complexity of downstream system, and quality of the whole selection such as diversity is easily overlooked. Even if we adopt supervised learning based deep neural networks to solve it, it is hard to perform well as we are lack of best training samples.

In this paper, we resort to a reinforcement learning approach. Using a model-free learning framework, the complex downstream system can be considered as a black-box environment and the agent is an ad selector. The agent sequentially selects items and feeds ad queue into the downstream system. The downstream system would estimate the final revenue for the ad queue, then this final eCPM (estimated cost per mille impressions) would be taken as a reward signal to guide the selection policy. The merits of this approach are as follows. Firstly, the complex downstream system is fully taken into account so that we can optimize the system from a global view. Meanwhile, model-free learning frees us from understanding the complicated system, and the AD selection module can learn to smoothly get adapted to the downstream system. Secondly, we can use reinforcement learning’s trial and error scheme to explore the better selections, and gradually improve our policy.

In summary, this paper offers the following main contributions:

  1. We propose an ad pruning agent which adapts to the downstream system. We hope this idea would shed light on the future design of the industrial sponsored search system.

  2. We propose a reinforcement learning based approach to tackling the ad selection problem, and this work has been successfully applied to Baidu’s real sponsored search system.

  3. For the concern of training efficiency and safety of the real online system, a simulator system for RL training has been devised and implemented.

2. Background

2.1. Ad retrieving

Ads organization and retrieving process in Sponsor Search Systems (SSSs) are simply introduced in this section. An ad in SSS generally includes four components: keyword, bid, creative, and style. Among them, the keyword, bid, and creative are clearly defined by advertisers, while style is either generated by our system or provided by advertisers. Ads in an advertiser account are hierarchically organized as shown in Fig. 3. From the top down, there are accounts, campaigns, groups (or units), and ads. Flexible restrictions (e.g., target geographical area, time period, daily budget) can be easily set by advertisers at each level. Specially, ads in a single group are usually designed with a similar intention, so the keywords will be shared to match the retrieval traffic.

Figure 3. Sample of ads structure in advertiser accounts.

The ad retrieving process is carried out as shown in Fig. 4. When a query comes, matched keywords will be triggered out firstly through the Keyword Matching module. To make it feasible and efficient in practice, multi-phase retrieval is then adopted, and several Key-Value indexing sheets are constructed and saved. For example, given a keyword, a list of ¡ (advertiser) user, unit¿ pairs are firstly acquired with the inverted keyword-unit indexing. If we expand the unit with ads information, further ¡keyword, user, unit, creative¿ will be obtained. Similarly, an ad creative can be expanded with various styles. At each phase we inquire the sheets, amount of candidates might be multiplied several times, and the full quantity would be significantly huge. More details about similar ad structure and retrieving process can be found in (Bendersky et al., 2010). In this paper, our pruning agent acts after indexing ad creatives and before expanding styles.

Figure 4. Candidates expand exponentially during ad retrieving.

2.2. Terminology

For clarity, we declare some terminology and notation in the following list:

  1. SHOW denotes the total shown ad counts on result pages;

  2. CTR denotes the average click ratio received by the search engine, which can be formalized as .

  3. CPM denotes revenue received by search engine for 1000 searches, which can be formalized as .

  4. eCTR denotes the estimated click through rate if an ad is shown.

  5. eCPM denotes the estimated CPM. It equals to .

  6. Bid is the price provided by an advertiser for a keyword.

  7. Charge is the true expense after the auction. Using GSP, charge is less than Bid.

3. Related Work

The combination of Deep Learning (DL) , known as Deep Reinforcement Learning (DRL), has led to great success, both in academic research and in industrial applications, such as games

(Silver et al., 2016), finance (Heaton et al., 2017), healthcare (Heaton et al., 2017), as well as Google’s machine translation system (Wu et al., 2016). Recently, the utilization of deep neural networks into sponsored search systems has yielded great benefits, particularly for matching queries and bidwords in the semantic space (Wu et al., 2018b; Gligorijevic et al., 2018). Advanced techniques such as generative sequence to sequence models have also been adopted to produce bidwords or match similar short sentences (Lee et al., 2018). However, there are only a few existing works that incorporate RL/DRL techniques to sponsored search, e.g. (Zhao et al., 2018) for Real-Time Bidding. And as far as we know, even few has been trialed for the ad pruning problem.

In this work, we propose a Reinforcement Learning based approach to optimizing ad pruning under the adaptation consideration of the complicated and dynamic downstream system. As for the subset selection in ad pruning, it is a well-known NP-hard problem that requires exponential time to solve it exactly. Previous studies rely on various hand-designed heuristics to approximate the solution. Recent advances in deep learning provides an elegant and efficient method to such combinatorial optimization problems (Caldwell et al., 2018; Bengio et al., 2018). In particular, we model the ad subset selection as a sequential decision making problem, and learns to maximize the overall reward of the selected subset, which is estimated by probing the downstream system. Similar optimization view to ours, (Buck et al., 2017) proposes an active question reformulation agent which interacts with a black box QA system and learns to reformulate questions to elicit best answers from the downstream.

4. Methodology

4.1. Problem Formulation

Suppose there are ad candidates provided by the upstream, denoted as , we are required to select best ads from and feed to the downstream to maximize the revenue.

Figure 5. In our problem, the agent is the Ad selector, and the downstream is the environment.
Figure 6. The sequential decision process of the pruning agent: The state comprises of the selected ads and unselected ones, and the action is to select one ad from the unselected set. In this illustration, the upstream provide ad candidates, and ads are picked out. At Step 1, the agent selects , then is moved to the selected ad set. Similarly, and are selected at Step 2 and Step 3. Then output set would be sent to the downstream.

We employ a model-free reinforcement learning approach for dealing with this problem. As shown in Fig. 5, the complicated downstream system is treated as a black box environment, and the agent sequentially selects

ads with one ad at each step. In particular, a Markov Decision Process is defined as follows (see Fig.

6). For each step () the state variable occupies the current unselected ad list as well as the already selected ad list . All ads in and are drawn from the original set . At start time, i.e. , the unselected set equals to the origin input while the selected set is empty, i.e. and . Based on the current state , each action is to choose one candidate ad from the unselected list

, and append it to the selected list. The policy of our agent is a selection probability distribution over the whole unselected candidate set

. It is approximated by a neural network parameterized by , and denoted as . When the selection is finished, these ads in would be sent to the downstream system for further ranking and auction. Finally, ads won out in this competition would be shown to users. We denote the selection route as , and the final winner ads as . Then eCPM of are used as the whole reward of this episode, that is,


The objective function we are going to optimize is:


4.2. Training Algorithm

We use Policy Gradient (PG) (Sutton et al., 2000) to solve the above problem. The optimization direction is decided by


and parameters are updated according to


where is the learning rate. The training procedure is explained in Algorithm 1.

while Agent’s parameter is not converged do
       sample a mini-batch of queries ;
       for each query  do
             go through the upstream to get ;
             if   then
                   Agent: sequentially pick out ads with parameter ;
                   add selection path as to selection logs;
                   send to the downstream ;
                   Env: calculate the reward ;
                   add reward to reward logs;
                   join and add to training dataset;
             end if
       end for
      sample (state, action, reward) from training dataset ;
       update parameters according to Equ (4) ;
end while
Algorithm 1 Training algorithm

Specially, we resort to our agent to pruning ads only if counts of the candidates provided by upstream is greater than . The selection path by our agent would be logged in the form of , and respective reward calculated by downstream is logged in the form of . The training samples are produced after joining these two logs.

One more problem is that the agent takes selecting actions for the downstream, but only one reward for the whole sequence is available. If we make no distinction between these actions and give equal reward for each selection, the training will be inefficient. Here we adopt reward shaping (Ng et al., 1999) is to overcome this shortcoming, which has been applied in Machine Translation (Bahdanau et al., 2016; Wu et al., 2018a). In our scenario, the reward is set according to the eCPM contributions. It means that the reward for selecting a shown ad is set to

, while the reward for selecting an unshown ad is zero. Additionally, since the eCPM differs a lot among different ads, to reduce this variance, logarithm transformation is adopted to smooth the original eCPM.

4.3. Training Architecture

People may argue that: why not use the real CPM as the reward rather than eCPM. RL algorithms commonly require a large number of interactions with the environment. On one hand using real CPM means that we have to wait for the real user’s feedback, which may take a long while. And on the one other hand policy exploration on the real traffic probably greatly damnify the daily revenue especially for a commercial system. For efficiency and safety concern, we have built an industrial sponsored search system simulator for our training, which can offer us a reliable and dense reward estimation without doing any harm to actual revenue.

Fig. 7 illustrates the training architecture with the simulator. Firstly, the whole system has been cloned as a simulation environment, and each query issued to the real online system will be copied to the simulated upstream. Secondly, the simulator agent will exploringly select ads according to the current policy, action trajectory of the selection procedure as well as the estimated reward by the simulated downstream will be logged and stored as training samples. Thirdly, the offline trainer update the policy parameters with the training data, and push to the simulator agent in nearly real-time. With such a training architecture, the online system, simulator system, and the trainer are decoupled, high-throughput and low-latency for efficient training can be assured.

Figure 7. Training architecture with simulator.

Before we publish a trained model to the online agent, to minimize the possible negative impact on the online system, two steps of model evaluations are also taken. In the first step, we check the performance of the model policies on the simulator environment by eCPM. And in the second, CPM is evaluated on the online system with a small fraction of real traffic. Only models perform well at both steps can be used in the entire online system.

5. Experiments

5.1. Setup

Experiments are conducted on Baidu’s sponsor search system. The model to approximate the candidate ad selection score at each step is a two-layer fully connected network. The input layer consists of designed features, and the output layer generates one-dimensional real values as ad scores. Final selection probabilities are obtained through a softmax layer. During the training, actions are taken according to the probability distribution for policy exploration, while during the online inference, only the candidate with maximum probability is chosen for the best exploitation.

More specifically, in our experiment, the maximum number of candidates is set to 1000, and the selection count is at most . Through grid search of related parameters, batch size during training is set to 128, and for each query in the batch, we sample 50 states from the decision route . Adam optimizer is adopted to update the agent parameters with , , and initial learning rate .

For features, two categories for each candidate ad are designed: static and dynamic. Static features describe either the ad itself, or the query/user itself, or ad-query properties, and dynamic features characterize attributes related to the already selected ad collection. That is, the difference lies in whether it changes with the sequential selecting process. The dynamic features aim to help maximize cumulative revenues, as well as maintain the diversity of advertisers. Main features we used are listed in Table 1.

Type Feature Name
static estimated click through rate
static query-ad relevance
static bid
static pre-trained query embedding
static pre-trained ad embedding
static if this ad is compatible with strong style
dynamic if the same advertiser’s ad has been selected
dynamic accumulated eCPM within same advertiser
Table 1. The main features.

5.2. Baselines

We compare our approach with two baselines. They are commonly used in industry and also adopted in our sponsored search system before this method. The first ranking baseline is by expected charge from the advertiser which equals to estimated click-through rate (eCTR) times bid. The second is ranked by estimated achievement from the downstream system, which is calculated by expected charge times the show probability (srq). The show probability is added due to the limited advertising position and multiple factors effect the eventual show results. Here we approximate the show probability of an ad to its accumulated show proportion in the past 30 days. In the following section, we denote these two baselines by eCTR*bid and eCTR*bid*srq.

5.3. Results

We show the key online A/B testing results in Table 2. Three most important indicators of our method are compared with the baselines. They are click count per thousand searches (CTR) , shown ad count per thousand searches (SHOW), and revenue per thousand searches (CPM). The results demonstrate that our method achieves positive improvements over all three concerned indicators with both the baseline methods. Considering that ad supplies are stable, increment in SHOW shows that our agent does select better ad candidates which are more compatible with the downstream system. In addition, the proposed method also gains significant CTR improvements, namely, 1.11% over eCTR*bid*srq and 2.21% over eCTR*bid. It denotes that the newly shown ads are greatly accepted by users. Owing to the consistency of our pruning agent and the down-stream system, as well as the end user preference, we finally achieve a dramatic 1.95% improvement of CPM.

baselines CTR SHOW CPM
eCTR*bid 2.21% 1.27% 1.95%
eCTR*bid*srq 1.11% 1.17% 0.99%
Table 2. Online A/B Test Results.

To illustrate the training performance of our pruning agent, we observe the training curves of our agent as well as its online performance with different training steps. Fig. 8(a) displays the training loss, and Fig. 8(b) plots the recall of top expected charge candidates (top 110 eCTR*bid). Both the decrease in training loss and increase in recall meet our expectation, and convergence is reached around 1500k steps. More specifically, the best recall of top eCTR*bid candidates is around 0.67 rather than 1, which demonstrates the imparity between our method and the eCTR*bid baseline. Furthermore, we choose several optimization versions after various training steps. Table 3 shows the online performance in comparison with the eCTR*bid baseline. We can observe that the indicators improves with training steps, and the best online performance is achieved after 2000k steps, which validates the effectiveness of our model training.

(a) Training loss with training steps (k)
(b) Recall of top ctr*bid ads with training steps (k)
Figure 8. Training curves
Training Steps CTR SHOW CPM
1 -2.90% -1.68% -11.20%
500k -0.13% 0.47% 0.54%
1000k 1.81 % 1.03 % 1.30%
1500k 2.02% 1.21% 1.61%
2000k 2.23% 1.26% 1.93%
2500k 2.21% 1.27% 1.95%
Table 3. Online performance with training steps

It is also worth mentioning that this agent assumes the downstream system to be a completely black box, whereas it may be not. For example, in our system, it is clear that the downstream part would select at most ads from a same advertiser. Even though that our model succeeds in learning to select diverse ads based on the given features and rewards, we attempt to directly apply this prior knowledge during both the training and online inferring. The result shows that this application brings us a significant return of performance (that is nearly saving 10% time in selecting actions) while without lessening the CPM revenue. We believe that making full use of similar prior knowledge in modeling is a valuable practice in industry systems.

6. Concluding Remarks

In this paper we have devised a reinforcement learning method to address the ad pruning problem in the real sponsored search engine environment. During ad retrieving, the ad candidates expands exponentially. To meet the latency requirements, these candidates need to be pruned earlier. An ad selection agent is set at the pruning point, and cuts the system into upstream and downstream. Without concerning the downstream, the selected ads might be filtered later. To address this problem, we have considered the complicated downstream as a black box environment, and our agent sequentially selects ads and feeds them into the downstream to get the eCPM reward for training. Online long term A/B test on Baidu’s sponsored search engine has showed that it greatly outperforms the CPM rule based selection approaches. A similar reinforcement learning based method has been applied in other Baidu products, which also yields great improvements.

In our mind, making ad selection adapted to the downstream system is crucial. This downstream adaptation method can be applied in many other online services. For example, most online search or recommendation service has a coarse-ranking and refined-ranking part, and the coarse-ranking’s job is to select limited candidates from a vast mount of items and feed them into the downstream part. Our method can be easily transplanted to these scenarios.

7. acknowledgment

We would like to thank Tianyu Wang, Zishuai Zhang, Ruiyu Yuan’s efforts in applying this RL method to different ad product scenarios. And we would also like to thank Yanjun Jiang and Tefeng Chen for their construction of the environment simulator.


  • D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio (2016) An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086. Cited by: §4.2.
  • M. Bendersky, E. Gabrilovich, V. Josifovski, and D. Metzler (2010) The anatomy of an ad: structured indexing and retrieval for sponsored search. In Proceedings of the 19th international conference on World wide web, pp. 101–110. Cited by: §2.1.
  • Y. Bengio, A. Lodi, and A. Prouvost (2018) Machine learning for combinatorial optimization: a methodological tour d’horizon. CoRR abs/1811.06128. External Links: Link, 1811.06128 Cited by: §3.
  • C. Buck, J. Bulian, M. Ciaramita, W. Gajewski, A. Gesmundo, N. Houlsby, and W. Wang (2017) Ask the right questions: active question reformulation with reinforcement learning. arXiv preprint arXiv:1705.07830. Cited by: §3.
  • J. R. Caldwell, R. A. Watson, C. Thies, and J. D. Knowles (2018) Deep optimisation: solving combinatorial optimisation problems using deep neural networks. CoRR abs/1811.00784. External Links: Link, 1811.00784 Cited by: §3.
  • J. Gligorijevic, D. Gligorijevic, I. Stojkovic, X. Bai, A. Goyal, and Z. Obradovic (2018) Deeply supervised semantic model for click-through rate prediction in sponsored search. arXiv preprint arXiv:1803.10739. Cited by: §3.
  • J. Heaton, N. Polson, and J. H. Witte (2017) Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry 33 (1), pp. 3–12. Cited by: §3.
  • M. Lee, B. Gao, and R. Zhang (2018)

    Rare query expansion through generative adversarial networks in search advertising

    In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 500–508. Cited by: §3.
  • A. Y. Ng, D. Harada, and S. Russell (1999) Policy invariance under reward transformations: theory and application to reward shaping. In ICML, Vol. 99, pp. 278–287. Cited by: §4.2.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §3.
  • R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §4.2.
  • L. Wu, F. Tian, T. Qin, J. Lai, and T. Liu (2018a)

    A study of reinforcement learning for neural machine translation

    arXiv preprint arXiv:1808.08866. Cited by: §4.2.
  • W. Wu, G. Liu, H. Ye, C. Zhang, T. Wu, D. Xiao, W. Lin, K. Liu, and X. Zhu (2018b) EENMF: an end-to-end neural matching framework for e-commerce sponsored search. arXiv preprint arXiv:1812.01190. Cited by: §3.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.
  • J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He (2018) Deep reinforcement learning for sponsored search real-time bidding. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1021–1030. Cited by: §3.