Introduction
With the advance of deep neural network
[LeCun, Bengio, and Hinton2015, Goodfellow et al.2016], DRL approaches have made significant progress for a number of applications including Atari games [Mnih et al.2015, Van Hasselt, Guez, and Silver2016, Mnih et al.2016] and robot locomotion and manipulation [Schulman et al.2015b, Levine et al.2016]. Recently, we also witness successful applications of DRL techniques to optimize the decisionmaking process from different aspects in ecommerce including online recommendation [Chen et al.2018], advertising bidding strategies [Jin et al.2018, Wu et al.2018, Zhao et al.2018] and product ranking [Hu et al.2018, Feng et al.2018].In online advertising, traditionally the ad slots are fixed, and the problem is to allocate right ad to the right slot for the right user [Mehta and others2013]. Previous studies [Jin et al.2018, Wu et al.2018, Zhao et al.2018] have shown that DRL is effective in bidding right display slots on behalf of advertisers on the basis of the click feedback from the Web users. However, in ecommence, advertising typically works with other services such as recommendation as shown in Fig. 1. For each user query and impression, the platform needs to select a mixed representation consisting of both recommended and advertising products. In this setting, the display slot is not a fixed prior. The number of advertising slots and their locations are dynamically changing on the basis of user interests and their profiles. In order to maintain the user retention rate and longterm revenue, the platform maintains global constraints such as the percentage of advertising product exposure. In such a context, a natural question arises: how can we leverage DRL to optimize advertising selection with adaptive exposure, given that some global constraints are satisfied? We call the above problem as advertising with adaptive exposure problem.
One natural direction is to model the advertising with adaptive exposure problem as a Constrained Markov Decision Process (CMDP) [Altman1999]. The CMDP framework has been used in constrained problem settings including electric grids [Tessler, Mankowitz, and Mannor2018], networking [Hou and Zhao2017], robots [Chow et al.2015, Gu et al.2017] and and portfolio optimization [Krokhmal, Palmquist, and Uryasev2002]
. Although the optimal policy for a smallsize CMDP with an available model can be computed using linear programming
[Altman1999], it is difficult to obtain the optimal policy considering the high complexity of realworld problems. Thus we need to resort to modelfree RL approaches to learn approximately optimal solutions [Achiam et al.2017, Tessler, Mankowitz, and Mannor2018].Existing modelfree approaches for solving CMDP require to parameterize policy and propagate the constraint violation signal over the entire trajectory [Achiam et al.2017, Prashanth and Ghavamzadeh2016]. Actually, most reinforcement learning algorithms fail to meet these requirements. For example, for Qlearning methods, e.g., Deep QNetwork (DQN) [Mnih et al.2015] learns value functions to update the policy. For ActorCritic methods, e.g., Asynchronous Advantage ActorCritic (A3C) [Mnih et al.2016] builds the rewardtogo based on an Nstep sample. To address this issue, Tessler et al. tessler2018reward propose the Reward Constrained Policy Optimization (RCPO), which converts the trajectory constraints into perstate penalties and dynamically adjusts the weights of perstate penalties during learning to propagate the constraint violation signal over the entire trajectory. However, in the advertising with adaptive exposure problem, we need to satisfy both daylevel (trajectorylevel) and querylevel (statelevel) constraints. RCPO only considers trajectorylevel constraints and cannot be directly applied here.
In this paper, we first model the advertising with adaptive exposure problem as a CMDP with perstate constraint (psCMDP). Then we propose a constrained twolevel structured reinforcement learning framework to learn optimal advertising policies satisfying both statelevel and trajectorylevel constraints. In our framework, the trajectorylevel constraint and the statelevel constraint satisfaction are separated into different levels of the learning process. The high level tackles the subproblem of selecting subtrajectory constraint for each lowlevel policy to maximize the longterm advertising revenues, while ensuring to satisfy the trajectorylevel constraint. One additional benefit of our twolevel framework is that we can easily reuse the lowlevel policy to train the highlevel constraint selection policy in case the trajectorylevel constraint is adjusted.
In the low level, we address the subproblem of learning an optimal advertising policy under a particular subtrajectory constraint provided by the highlevel part. We use auxiliary task to train the policy [Jaderberg et al.2016]. Then we can apply not only algorithms which explicitly consider the constraint [Achiam et al.2017, Tessler, Mankowitz, and Mannor2018], but also offpolicy methods such as DQN [Mnih et al.2015] and Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2015]
. We also propose Constrained Hindsight Experience Replay (CHER) to accelerate the lowlevel policy training. Through experiments, we can observe that under the same trajectory constraint, compared to the baseline policy, our method can effectively improve the final revenue. In the low level, we verify that CHER mechanism can significantly improve the training speed and reduce the deviation and variance from the perstate constraint. Moreover, in the high level, our method can make good use of the lower level policy set to learn highlevel policies requiring different trajectorylevel constraints.
Preliminary: Constrained Reinforcement Learning
Reinforcement learning (RL) allows agents to interact with the environment by sequentially taking actions and observing rewards to maximize the cumulative reward [Sutton, Barto, and others1998]. RL can be modeled as a Markov Decision Process (MDP), which is defined as a tuple . is the state space and is the action space. The immediate reward function is . is the state transition dynamics, . There exists a policy on , which defines an agent behavior. The agent uses its policy to interact with the environment and obtains a trajectory . Its goal is to learn an optimal policy which maximizes the expected return from the initial state:
(1) 
Here is the discount factor. The Constrained Markov Decision Process (CMDP) [Altman1999] is generally used to deal with the situation, which considering the constraints that restrict the set of allowable policies for that MDP. Specifically, CMDP is augmented with auxiliary cost functions , and limit . Let be the cumulative discounted cost of policy . The expected discounted return is defined as follows:
(2) 
The set of feasible stationary policies for a CMDP is then:
(3) 
And the policy is optimized by limiting the policy in the Equation 1.
For CMDP optimization problem, geibel2006reinforcement geibel2006reinforcement summarizes several optimization methods. First, LinMDP, a linear program describing optimal solutions of the unconstrained MDP is augmented by an additional constraint expressing is required to hold. This augmented program can be solved with standard linear programming methods. Second, WeiMDP introduces a weight parameter and a derived weighted reward function is defined as:
(4) 
where and are the reward and the auxiliary cost under the transition respectively. For a fixed , this new unconstrained MDP can be solved with standard methods, e.g. QLearning [Sutton, Barto, and others1998]. Last, AugMDP augments the state space with the current accumulated cost and gives an additional penalty when violating the constraint. For DRL methods, achiam2017constrained achiam2017constrained propose to replace the optimization objective and constraints with surrogate functions. They use Trust Region Policy Optimization [Schulman et al.2015a] to learn the policy, achieve nearconstraint satisfaction at each iteration. tessler2018reward tessler2018reward use a method similar to WeiMDP. They use the weight
as the input to the value function and dynamically adjust the weight by backpropagation.
Advertising With Adaptive Exposure
In ecommerce platforms (see Fig. 1), the eshopping platform receives user queries in order, ^{1}^{1}1Table 1 summarizes the notations.. When a consumer sends a shopping query , commodities are exposed to the query based on the user’s shopping history and personal preferences. The commodities are composed by advertising and recommended products. From the advertising perspective, we would like to expose more ads to increase advertising income. However, since the advertiser pays an extra price for the advertising products, the ultimate exposed ads for users are not necessarily their favorite or needed products. To ensure users’ shopping experience, we need to impose the following constraints:

, the total percentage of ads exposed in one day should not exceed a certain threshold :
(5) 
, the percentage of ads exposed for each query should not exceed a certain threshold :
(6)
where . This means that we can exploit this inequality requirement to expose different numbers of ads to different queries according to users’ profiles. Expose more ads to users who can increase the average advertising revenue, while expose less ads for the others. In this way, we can increase the total advertising revenue while satisfying both the querylevel and daylevel constraints.
For each query , traditional ecommerce systems use fixedslot to expose ads: , where is the number of fixed slots. The relative positions of the ads are fixed. However, it is obvious that this advertising mechanism is far from being optimal. Different consumers have different shopping habits and preferences towards different products. Therefore, it is expected that we can adaptively exposure more advertising products to those customers who are more likely to click and purchase these items to increase the advertising revenue at the right period, given that the above two types of constraints are satisfied, and vice versa.
Notation  Description 
The sequence of incoming queries. , is the total number of queries visiting the platform within one day, different days have different .  
The th query in the day. In this paper, we usually use i to refer to th query.  
The candidate ads set for th query . , is the number of candidate ads.  
The number of recommended products.  
The number of commodities shown to query . Usually, . And  
The number of ads exposed for th query  
The total percentage of ads exposed in one day.  
The percentage of ads exposed for th query . .  
Features of the ad candidate set w.r.t, query .  
The maximum percentage of the total ads exposed in one day  
The maximum percentage of the ads exposed for each query 
To this end, recently major ecommence companies (e.g., Alibaba) begin to adopt another more flexible and mixed exposing mechanism for exposing advertising and recommended products (Fig. 1). Specifically, for each query , the recommendation and the advertising systems first select the top and items based on their scoring systems independently. Then these commodities are sorted altogether by their scores and the top items are exposed to this query. On one hand, for each query, the size of can be automatically adjusted according to the quality of the candidate products and ads. On the other hand, the positions of items are determined by the quality of the products and ads, which can further optimize the user experience.
The scoring systems for both recommendation and advertising sides can be viewed as black boxes and the score for each item is known as a prior. However, from the advertising perspective, we can adaptively adjust the score of each ad to change their relative positions and the number of ads to be exposed eventually. This additional advertising score adjustment component is shown in red color in Fig 1.
Formulation
From the advertising perspective, the above advertising exposure problem can be seen as a bidding problem which adjusts the original scores of the ads. Formally, for the th ad in query , its score is adjusted as follows:
(7) 
where . is a bidding function and is the parameters of . is the original score given by the advertising system for in query
. Since adjusting the score affects the probability of ads’ exposure, we define
as the probability of winning the bid request with bid adjustment ratio . Supposing that the revenue of the advertising product in query is , then under the premise of satisfying the constraints and , the optimization goal of the advertising system can be written as follows:(8) 
Since queries arrive in chronological order, to satisfy the constraint , if the current system expose too many ads, then the system should expose fewer ads later. Hence the above problem is a sequence decisionmaking problem.
Since there exist both daylevel and querylevel constraints, the traditional CMDP cannot be directly applied here. We propose a special CMDP [Altman1999], we term as CMDP with perstate constraint (psCMDP). Formally psCMDP can be defined as a tuple . Compared to the original CMDP [Altman1999], we can see that the difference here is that for each trajectory , psCMDP not only needs to satisfy the trajectorylevel constraint :
(9) 
but also the perstate constraint over each query:
(10) 
where , is the limit of . So the set of feasible stationary policies for a psCMDP is:
(11) 
In psCMDP, for the above advertising optimization problem, is a set of states and for each state : . is the number of ads already exposed and is the system information, such as current time, the number of queries currently visiting. is the set of actions and for each action : . is the coefficient for the th ad for query , where . is the reward function, after the system adjusts the score by action in state . Suppose is the set of exposure ads, , where is the revenue of displaying ad in . State transition P models the dynamics of queries visiting sequence and system information change. Specifically, for the constraints:

is the advertising exposure constraint over a day (trajectory): constraint ,

is the advertising exposure constraint over each query: constraint ,
So the optimal policy is as follows:
(12) 
Solution: Constrained Twolevel Reinforcement Learning
The psCMDP optimization problem can be viewed as a CMDP problem with the number of constraints dynamically changing since the length of different trajectories can be different. Therefore the existing approaches [Achiam et al.2017, Tessler, Mankowitz, and Mannor2018] for addressing CMDP problems cannot be directly applied here due to the following reasons. First, the existing approaches can only handle problems with the number of constraints fixed in advance, however, this assumption usually does not hold in psCMDP. Second, existing CMDP approaches such as CPO [Achiam et al.2017] uses the conjugate gradient method to optimize the policy, and the computational cost would rise significantly with the increase of the number of constraints, resulting in this kind of approach infeasible to use eventually. Last, existing CMDP approaches [Achiam et al.2017, Tessler, Mankowitz, and Mannor2018] usually optimize a policy directly towards maximizing the expected accumulated reward and also satisfying the original trajectorylevel constraints. However, as we analyzed previously, in psCMDP optimization problem, we may need to learn to select different constraints over different subtrajectories to maximize the longterm advertising revenue as long as the constraint over the whole trajectory is satisfied.
Therefore, we propose a constrained twolevel reinforcement learning framework to address this constrained advertising optimization problem. as illustrated in Fig. 2. The optimization task of the high level is to learn the optimal policy for selecting constraints for different subtrajectories while ensuring that the constraint over the whole trajectory is satisfied (Algorithm 1, Line 4). Given a subtrajectory constraint from the high level, the low level is responsible for learning the optimal policy over its subtrajectory while ensuring that both the subtrajectory constraint and the prestate constraints are satisfied (Algorithm 1, Line 2  3). In this way, the original psCMDP optimization problem is simplified by decoupling it into the two independent optimization subproblems. Another benefit of decoupling it as a twolevel learning problem is that when changing the trajectory constraint , the optimal policies learned at the low level can be transferred and reused without learning from scratch, thus the overall learning process can be significantly accelerated. It is worth pointing out that our framework is similar with hierarchical reinforcement learning. However, in contrast to learn how to switch option [Bacon, Harb, and Precup2017] and alleviate the sparse reward problem[Kulkarni et al.2016], our work leverage the idea of hierarchy to decompose different constraints into different levels.
Lower Level Reinforcement Learning
In the low level, we assume that the subtrajectory constraints are selected from a discrete set , i.e., . It is obvious that any subtrajectory constraint ) is a stronger constraint than the perstate constraint . Otherwise, it would contradict with the perstate constraint . Therefore, given a subtrajectory constraint , we can first transform it into a perstate constraint , and then the original perstate constraint and the new one can be reduced as a unified perstate constraint . Given a subtrajectory constraint ), we need to learn an optimal lowlevel policy while also satisfying the perstate constraint .
One natural approach in CMDP is to guide the agent’s policy update by adding an auxiliary penalty value related with the perstate constraint to each immediate reward.Therefore, during policy update, both the current and the future penalty values are considered. However, in our setting, since each transition satisfies the constraint independently, each action selection does not need to consider its future perstate constraints . Therefore we propose a method similar to auxiliary tasks [Jaderberg et al.2016]
by adding an auxiliary loss function based on perstate constraints. The original loss function is
, and the loss function for the perstate constraint is . In training, the policy is updated towards the direction of minimizing the weighted sum of the above two:(13) 
where and are the weights, are the parameters of the value network.
For example, for DQN [Mnih et al.2015], the original Bellman residual loss function is:
(14) 
and the additional loss function for perstate constraints can be defined as follows:
(15) 
where and are the online network parameters and the target network parameters. is the function of . E.g., , where is a constant of the upper bound of . is a function to measure the degree of transition satisfying constraint . The value of increases as approaches the target of and vice versa. Therefore, can guide the agent to achieve higher Q value under which also satisfies the perstate constraints. Similarly we can modify the critic update to guide the actor update for any actorcritic algorithms such as DDPG [Lillicrap et al.2015].
Constrained Hindsight Experience Replay
To increase the sample efficiency, we propose leveraging the idea of hindsight experience replay (HER) [Andrychowicz et al.2017] to accelerate the training of optimal policies for different subtrajectory constraints. HER relieves the problem of sample inefficiency in DRL training by reusing transitions, which can be obtained by using different goals to modify reward. We extend this idea to propose the constrained hindsight experience replay (CHER). Different from HER, CHER does not directly revise the reward. It uses different constraints to define the extra loss during training. The overall algorithm for training lowlevel policies is given in Algorithm 2. When we learn a policy satisfying constraint , it obtains the transition: (Algorithm 2, Line5  Line8). We can replace with other constraints, such as , and then reuse those samples to to train a policy satisfying another constraint (Algorithm 2, Line12).
Higher Level: Constraint Choice Reinforcement Learning
The goal of the highlevel task is to maximize the expected longterm advertising revenue while satisfying the trajectory constraint . The detail of our propose method, constraint choice reinforcement learning (CCRL), is as follow (Algorithm 3). Each time the control is returned to the high level, its policy chooses a constraint from the discrete set of constraints for the lowlevel policy . When a trajectory terminates, we judge whether the trajectory satisfies the trajectory constraint , and add delay reward for the constraint satisfaction into the final reward:
(16) 
where is reward of final state, is is the additional delay reward and is the weight of .
Since the penalty signal is only added to the final reward, a typical problem is that it is difficult to propagate the penalty signal back to earlier states over the trajectory, thus impeding the learning efficiency. In our twolevel learning framework, this issue can be alleviated to certain extent since we can control the granularity of highlevel behaviors. If we increase the period length (the length of each subtrajectory) between two consecutive highlevel actions, the highlevel decisionmaking steps can be reduced and thus facilitate the propagation of termination state penalty to previous highlevel states. This setup is similar to the temporal abstraction of hierarchical reinforcement learning [Bacon, Harb, and Precup2017, Kulkarni et al.2016].
Experiments
Policy  performance  Policy  performance  Policy  performance  
pvr  eCPM  pvr  eCPM  pvr  eCPM  
baseline policy  0.3558  1.3143  baseline policy  0.4100  1.6422  baseline policy  0.4608  1.8099 
our approach  0.3605  1.3942  our approach  0.4141  1.6795  our approach  0.4673  1.9023 
0.0025  0.0131  0.0016  0.0076  0.0012  0.0052 
Experimental Setup
Our experiments are conducted on a real dataset of one of the world’s largest ecommerce platforms, and the data collection scenario is consistent with the problem description in Section 3. Based on the log data, we collect 15 recommended products and their scores by the recommendation system as candidate recommended items, and the information of 15 advertising products, such as: eCPM, price, predicted ClickThroughRate(pCTR), and initial score. We can replay the data to train and test the effect of our algorithm offline in two ways: 1) after ads adjusting scores, whether the quantity of ads in the 10 exposure items meets and , and 2) the rewards of the exposed ads. Similar settings can be found in other research work [Cai et al.2017, Jin et al.2018, Perlich et al.2012, Zhang, Yuan, and Wang2014]. Actually, the positions of ads have impacts on user behaviors. E.g., the ads in front are more likely to be clicked, and so on. Hence the reward is defined as:
(17) 
where is the eCPM value of the ad , and corrects the eCPM by considering the influence of different positions. is also fitted using the real data.
Since the actual amount of data is significantly large, we sample a part of the data for environmental simulation, and verify that the sampled data has representativeness to the real data set. When training the agent, the state consists of 46 dimensions including the characteristics of the 15 candidate ads: eCPM, price, pCTR, and the number of exposed ads. Action is the coefficient to adjust scores for 15 ads, . After adjusting scores using actions, 15 candidate advertising commodities and 15 candidate recommended commodities are sorted based on the new scores. The reward is calculated according to the ads in the first 10 exposure items. The data in the dataset interacts with the agent in chronological order, finishing an episode at the end of the day.
Does CHER improve performance?
To verify the effectiveness of using CHER, we compare the impact of using CHER on the learning speed and stability with a baseline DDPG [Lillicrap et al.2015] under the same network structure and parameters^{2}^{2}2more detail in appendix . Suppose is the number of exposure ads for each query, and cannot exceed 5, so we set consisted of 5 constraints. Each goal in represents the expected average number of ads exposed per query. Intuitively, we can use the constraint as part of the input and use a network to satisfy all constraints [Andrychowicz et al.2017]. However, considering the learning stability, we use 5 different networks to satisfy different constraints. Since we use DDPG, we add to during Critic training,
(18) 
(19) 
where is set to 10, and is the percentage of ads exposed for th query . We make critic to guarantee the cumulative rewards with satisfing perstate constraints.
We set up 4 different random seeds, and the experiment results are shown in Fig. 3. We only show the results of due to the space limit. The other results are shown in Appendix. The criterion for evaluating the experiment is the degree to which the result is close to the target constraint. The experiment performs better if its result is closer to the target constraint. We can find that, under different constraints, DDPG with CHER is better than DDPG in terms of training speed, achieving and maintaining constraints. For example, under the constraint of , compared with DDPG, DDPG with CHER is faster to reach and then maintain near the constraint during learning, and the variance is small.
Verify the Effectiveness of Constraint Choice Reinforcement Learning
Next we evaluate the CCRL based on the behaviour policy set. We use double DQN with the dueling structure^{3}^{3}3more detail in appendix to train the higher level constraint selection policy (CCP). We verify that under different trajectory constraints, CCRL can improve the final revenue with satisfying the trajectory constraint. To distinguish policies in the behaviour policy set, we use to refer to the lowlevel policies trained using . We use each hour as the temporal abstraction, which means that the higher level CCP makes a decision every hour. After selecting the constraint, the behavior policy of the lower level under this constraint is used within one hour to adjust scores of the advertising commodities. We use hourly level information as input to DQN, such as eCPM of the last hour, the current time, PVR from the beginning of the day to the present.
To make effective comparisons, we use and in the behaviour policy set as the baseline policy respectively. By performing a fixed baseline policy for one day, we can get the average eCPM and the number of exposure ads. Then we set the number of exposure ads with the baseline policy as the day constraint to train the CCP. The goal of the higher level policy is: first, to achieve approximately equal amount of exposure ads with the baseline policy; second, to improve eCPM as much as possible. We can see from Table. 2 and Fig. 4 ^{4}^{4}4other results are provided in appendix that the CCP can increase the eCPM of one day compared to the baseline policy under the same constraint . Therefore, we observe that the our policy learns to expose different numbers of ads in different time periods. More ads are exposed when the value of the ads in the query is higher. Conversely, fewer ads are shown in other time slots.
To verify whether the CCP finally learns a valid advertisement quantity allocation, we use a CCP trained by constraint of baseline policy , continuously executing for 1000 days. Then we count the numbers of different behavior policies per hour, as shown in the appendix. We calculate 1) the number of ads exposed in the current hour divided by the total number of items exposed in the current hour (Fig. 5 (a)), and 2) the number of ads exposed in the current hour divided by the total number of items exposed per day (Fig. 5 (b)). We found that during hour11hour15 period, the CCP selects more to show more ads. This is consistent with the statistics of our online real data. During the hour11hour15 period, the average eCPM of the advertising candidate set is higher than other periods, so exposing more ads can increase the eCPM income for one day.
Conclusion and Future Work
In this paper, we investigate in advertising exposure problem with different level constraints in Ecommerce and propose the constrained twolevel reinforcement learning to solve this problem. Considering explicitly different levels during learning, our approach use the CHER to accelerate the training in the lower level and reuse the trained behavioral policies at different constraint selection policies in the higher level. As a step towards solving real world Ecommerce problems with constraints using DRL, we believe there are many interesting questions remaining for future work. One worthwhile direction is how to generalize our framework to ecommerce problems with significantly more constraints. Another direction is how to extend optioncritic architecture [Bacon, Harb, and Precup2017] to our work, automatically learning the length of each subtrajectory rather than fixing the length of each subtrajectory.
References
 [Achiam et al.2017] Achiam, J.; Held, D.; Tamar, A.; and Abbeel, P. 2017. Constrained policy optimization. arXiv preprint arXiv:1705.10528.
 [Altman1999] Altman, E. 1999. Constrained Markov decision processes, volume 7. CRC Press.
 [Andrychowicz et al.2017] Andrychowicz, M.; Wolski, F.; Ray, A.; Schneider, J.; Fong, R.; Welinder, P.; McGrew, B.; Tobin, J.; Abbeel, O. P.; and Zaremba, W. 2017. Hindsight experience replay. In Proceedings of NIPS, 5048–5058.
 [Bacon, Harb, and Precup2017] Bacon, P.L.; Harb, J.; and Precup, D. 2017. The optioncritic architecture. In Proceedings of AAAI, 1726–1734.
 [Cai et al.2017] Cai, H.; Ren, K.; Zhang, W.; Malialis, K.; Wang, J.; Yu, Y.; and Guo, D. 2017. Realtime bidding by reinforcement learning in display advertising. In Proceedings of WSDM, 661–670. ACM.
 [Chen et al.2018] Chen, S.Y.; Yu, Y.; Da, Q.; Tan, J.; Huang, H.K.; and Tang, H.H. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of SIGKDD, 1187–1196. ACM.
 [Chow et al.2015] Chow, Y.; Tamar, A.; Mannor, S.; and Pavone, M. 2015. Risksensitive and robust decisionmaking: a cvar optimization approach. In Proceedings of NIPS, 1522–1530.
 [Feng et al.2018] Feng, J.; Li, H.; Huang, M.; Liu, S.; Ou, W.; Wang, Z.; and Zhu, X. 2018. Learning to collaborate: Multiscenario ranking via multiagent reinforcement learning. In Proceedings WWW, 1939–1948. International World Wide Web Conferences Steering Committee.
 [Geibel2006] Geibel, P. 2006. Reinforcement learning for mdps with constraints. In Proceedings of ECML, 646–653. Springer.
 [Goodfellow et al.2016] Goodfellow, I.; Bengio, Y.; Courville, A.; and Bengio, Y. 2016. Deep learning, volume 1. MIT press Cambridge.
 [Gu et al.2017] Gu, S.; Holly, E.; Lillicrap, T.; and Levine, S. 2017. Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In Proceedings of ICRA, 3389–3396. IEEE.
 [Hou and Zhao2017] Hou, C., and Zhao, Q. 2017. Optimization of web servicebased control system for balance between network traffic and delay. IEEE T AUTOM SCI ENG (99):1–11.
 [Hu et al.2018] Hu, Y.; Da, Q.; Zeng, A.; Yu, Y.; and Xu, Y. 2018. Reinforcement learning to rank in ecommerce search engine: Formalization, analysis, and application. arXiv preprint arXiv:1803.00710.
 [Jaderberg et al.2016] Jaderberg, M.; Mnih, V.; Czarnecki, W. M.; Schaul, T.; Leibo, J. Z.; Silver, D.; and Kavukcuoglu, K. 2016. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.
 [Jin et al.2018] Jin, J.; Song, C.; Li, H.; Gai, K.; Wang, J.; and Zhang, W. 2018. Realtime bidding with multiagent reinforcement learning in display advertising. arXiv preprint arXiv:1802.09756.
 [Krokhmal, Palmquist, and Uryasev2002] Krokhmal, P.; Palmquist, J.; and Uryasev, S. 2002. Portfolio optimization with conditional valueatrisk objective and constraints. Journal of risk 4:43–68.
 [Kulkarni et al.2016] Kulkarni, T. D.; Narasimhan, K.; Saeedi, A.; and Tenenbaum, J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Proceedings of NIPS, 3675–3683.
 [LeCun, Bengio, and Hinton2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature 521(7553):436.
 [Levine et al.2016] Levine, S.; Finn, C.; Darrell, T.; and Abbeel, P. 2016. Endtoend training of deep visuomotor policies. JMLR 17(1):1334–1373.
 [Lillicrap et al.2015] Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
 [Mehta and others2013] Mehta, A., et al. 2013. Online matching and ad allocation. Foundations and Trends in Theoretical Computer Science 8(4):265–368.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529.
 [Mnih et al.2016] Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In Proceedings of ICML, 1928–1937.
 [Perlich et al.2012] Perlich, C.; Dalessandro, B.; Hook, R.; Stitelman, O.; Raeder, T.; and Provost, F. 2012. Bid optimizing and inventory scoring in targeted online advertising. In Proceedings of SIGKDD, 804–812. ACM.
 [Prashanth and Ghavamzadeh2016] Prashanth, L., and Ghavamzadeh, M. 2016. Varianceconstrained actorcritic algorithms for discounted and average reward mdps. Machine Learning 105(3):367–417.
 [Schulman et al.2015a] Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015a. Trust region policy optimization. In Proceedings of ICML, 1889–1897.
 [Schulman et al.2015b] Schulman, J.; Moritz, P.; Levine, S.; Jordan, M.; and Abbeel, P. 2015b. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438.
 [Sutton, Barto, and others1998] Sutton, R. S.; Barto, A. G.; et al. 1998. Reinforcement learning: An introduction. MIT press.
 [Tessler, Mankowitz, and Mannor2018] Tessler, C.; Mankowitz, D. J.; and Mannor, S. 2018. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074.
 [Van Hasselt, Guez, and Silver2016] Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep reinforcement learning with double qlearning. In Proceedings of AAAI, volume 2, 5. Phoenix, AZ.
 [Wu et al.2018] Wu, D.; Chen, X.; Yang, X.; Wang, H.; Tan, Q.; Zhang, X.; and Gai, K. 2018. Budget constrained bidding by modelfree reinforcement learning in display advertising. arXiv preprint arXiv:1802.08365.
 [Zhang, Yuan, and Wang2014] Zhang, W.; Yuan, S.; and Wang, J. 2014. Optimal realtime bidding for display advertising. In Proceedings of SIGKDD, 1077–1086. ACM.
 [Zhao et al.2018] Zhao, J.; Qiu, G.; Guan, Z.; Zhao, W.; and He, X. 2018. Deep reinforcement learning for sponsored search realtime bidding. arXiv preprint arXiv:1803.00259.
Appendix
A1. Network structure and training parameters
Cher
Both the actor network and the critic network are fourlayer fully connected neural networks, where each of the two hidden layers consists of 20 neurons and a ReLU activation function is applied on the outputs of the hidden layers. A tanh function is applied to the output layer of the actor network to bound the size of the adjusted scores. The input of the actor network and critic network is a tensor of shape 46 representative feature vectors of the query’s candidate ad items and the number of currently exposed items. The output of the actor network and the critic network are respectively 15 actions and corresponding Qvalues. The learning rate of the actor is 0.001, the learning rate of the critic is 0.0001, and the size of the replay buffer is 50000. The exploration rate starts from 1 and decays linearly to 0.001 after 50,000 steps.It is worth pointing out that in the environment, we will make certain adjustments to the action, such as adding a certain value, performing certain scaling, to ensure that the operation of adjusting the score is in line with the business logic. Therefore, the output of the action is not the actual adjusted scores. We consider this adjustment part as part of the environmental logic. It does not affect the training of the network.
Constraint Choice Reinforcement Learning
DQN network has threelayer neural networks. The hidden layer consists of 20 neurons and a ReLU activation function is applied on the outputs of the hidden layer. Then we connect the hidden layer output to: 1) the nodes with the same number of actions, which is used to simulate the action advantage value , 2) only one node, which is used to simulate the state value . Finally we obtain . The size of the replay buffer is 5000. We use the prioritized replay to sample the replay buffer. The learning rate is 0.0006. The exploration rate starts from 1 and linearly decays to 0.001 after 1000 steps. Also we set the discount factor .
A2. Constrained Hindsight Experience Replay
Under different constraints, DDPG with CHER is better than DDPG in terms of training speed, achieving and maintaining constraints, see Fig. 6.
A3. Constraint Selection Policy Training Result
Experimental results compared to other baselines in Fig. 7. The constraint selection policy can obviously increase the eCPM of one day compared to the baseline policy under the same constraint .
A4. The Numbers of Different Behavior Policies Per Hour
In Fig. 8, we clearly found that during the hour11 to hour15 period, the constraint selection policy selects more , which shows more ads. This is consistent with the statistics of our online real data, that is, during the hour11hour15 period, the average eCPM of the advertising candidate set is higher than other time periods, so exposing more ads can increase the eCPM income for one day.