The rapid development of Internet and smart devices has created a decent environment for advertisement industry. As a billion dollar business, real-time bidding (RTB) has gain continuous attention in the past few decades .
Bidding System. Online users, advertisers and ad platforms constitute the main players in real-time bidding. A typical RTB (in Fig. 1) setup consists of publishers, supply-side platforms (SSP), data management platforms (DMP), ad exchange (ADX), and demand-side platforms (DSP). In one bidding round, when an online browsing activity triggers an ad request, the SSP sends this request to the DSP through the ADX, where eligible ads compete for the impression. The bidding agent, DSP, represents advertisers to come up with an optimal bid and transmits the bid back to the ADX (e.g. usually within less than 200ms ), where the winner is selected to be displayed and charged by generalized second price (GSP) .
Our work focus on DSP, where bidding optimization happens. To conduct real-time bidding, two fundamental challenges need to be addressed. Firstly, RTB environment is higly dynamic. In [27, 21, 29], researchers make a strong assumption that the bidding process is stationary over time. However, the sequence of user queries (e.g., those incurring impressions, clicks, or conversions) is time-dependent and mostly unpredictable , where the outcome influences the next auction round. Traditional algorithms usually learn an independent predictor and conduct fixed optimization that amounts to a greedy strategy, which often does not lead to the optimal return . Agents with reinforcement learning (RL) address the aforementioned challenge to some extent [28, 7]. By learning from both the immediate feedback and the long-term reward, RL based methods are able to alleviate the instability.
However, these methods are all limited to either revenue or ROI, which is only one part of the overall utility of the industry. In the problem of RTB, we posit that the utility is two-fold, as outlined: (i) the cumulative cost should be kept within the budget; (ii) the overall revenue should be maximized. In sum, the second challenge is that real-world RTB industry needs to consider multiple objectives, which is not adequately addressed in the existing literature.
To address the aforementioned challenges, we propose a Multi-Objective Actor-Critic model, named MoTiAC. We generalize the popular asynchronous advantage actor-critic (A3C)  reinforcement learning algorithm for multiple objectives in RTB setting. Our model employs several local actor-critic networks with different objectives to interact with the same environment and then updates the global network asynchronously according to different reward signals. Instead of using a fixed linear combination of different objectives, MoTiAC is able to decide on adaptive weights over time according to how well the current situation conforms with agent’s prior. To our best knowledge, this is the first multi-objective reinforcement learning model for RTB problems. We comprehensively evaluate our model on a large-scale industrial dataset, the experimental results verify the superiority of the approach.
Contributions. The contributions of our work can be summarized as follows:
We identify two critical challenges in RTB and provide motivation to use multi-objective RL as the solution.
We generalize A3C and propose a novel multi-objective actor-critic model MoTiAC for optimal bidding, which to our knowledge is the first in the literature.
We mathematically prove that our model will converge to Pareto optimality and empirically evaluate MoTiAC using a proprietary real-world commercial dataset.
2 Related Work
Real-time Bidding. RTB has generated much interest in the research community. Extensive research has been devoted to predict user behaviors, e.g. click through rate (CTR) , conversion rate (CVR) . However, bidding optimization remains one of the most challenging problems.
In the past, researchers  have proposed static methods for optimal bidding, such as constrainted optimization , to perform impression-level evaluation. However, traditional methods constantly ignore that real-world situations in RTB are often dynamic  due to the unpredictability of user behavior  and different marketing plans 
from advertisers. To account for the uncertainty, the auction process of optimal bidding is currently formulated as a markov decision process (MDP)[3, 21, 22, 7, 10], where tools are usually from reinforcement learning literature.
. With the advancement of GPU and deep learning (DL), more successfully deep RL algorithms[12, 9, 11, 11] have been proposed and applied into various domains. Meanwhile, there are previous attempts to address multi-objective reinforcement learning (MORL) problem, where the objectives are mostly combined by static or adaptive linear weights [13, 1] (single-policy) or captured by set of policies and evolving preferences  (multiple-policy). For more detailed discussions about MORL, we encourage readers to refer to [16, 24] for more details.
3.1 Background of oCPA
In online advertising realm, there are three main ways of pricing. Cost-per-mille (CPM)  is the first standard, where revenue is proportional to impression. Cost-per-click (CPC)  is a performance-based model, i.e., only when users click the ad can the platform get paid. Then comes the cost-per-acquisition (CPA) model with payment attached to each conversion event. Though all of these great changes have happened, there is one thing that will never change: ad platforms try to maximize revenue while simultaneously maintaining the overall cost within the budget predefined by advertisers.
In this work, we focus on one pricing model, called optimized cost-per-acqusition (oCPA), in which advertisers are supposed to set a target CPA price, denoted by for each conversion, and platforms will charge by every click.
3.2 Bidding Process and RTB Problem
As is shown in Fig. 2, in each bidding round, smart agent is required to generate a cpcbid
for each ad. Meanwhile, predicted CVR and CTR (denoted as pCVR and pCTR) are calculated based on the features. The estimated cost-per-mille bid, called eCPM, is given by,
which will be ranked later. The highest ad wins the bid and will be displayed on users’ screen. This happens in the impression stage. After the impression, ad gets a chance to be clicked, and hopefully get conversion later. The overall cost is calculated by multiplying total clicks with (the second largest cpcbid in that round according to GSP), which should be within advertisers’ budget. The overall revenue is calculated by multiplying total conversions with . The key part of RTB problem is how to calculate cpcbid and properly allocate impression for ads so as to meet two objectives: (i) the cost is within the pre-defined budget (ii) while maximizing the total revenue.
In this section, we formulate the multi-objective actor-critic model (MoTiAC), to address the aforementioned problem in dynamic RTB environment. The organization of this section is: (i) in Sec. 4.1, we will give a brief introduction of A3C model in RTB setting; (ii) we propose to use Reward Partition for multiple rewards and justify its superiority in Sec. 4.2; (iii) in Sec. 4.3, we present interesting analysis of model updating and prove that it will converge to Pareto optimality.
4.1 A3C Model in RTB
Actor-critic model was firstly proposed in , then  generalized the discrete action space to continuous one, and  advanced it with multiple actor-critics updating the same global network asynchronously (A3C). A typical actor-critic reinforcement learning setting consists of:
action: action is the generated cpcbid for each ad based on the input state. Instead of using discrete action space , the output of our model is the action distribution, from which we sample the cpcbid.
reward: reward is a feedback signal from the environment to evaluate how good the current action is. In our model, each objective has its own reward.
policy: policy defines the way of agent taking action at state
, which can be denoted as a probability distribution. Policy is the key in RL based models.
In an actor-critic architecture, the critic
uses learning methods with a neural network and theactor is updated in an approximated gradient direction based on the information provided by the critic . There exists a self-improving loop in this process: following the current policy , RL agent plays an action in state , and receives reward signal from environment. Then this action will lead to a new state , and agent repeats this round again. One trajectory can be written as . For each policy , we define the utility function as,
where denotes the distribution of trajectories under policy , and is a return function over trajectory , typically calculated by summing all the reward signals in the trajectory. After sampling trajectories from policy , parameters will be updated after one or several rounds based on tuple in each trajectory in order to maximize the utility
. Stochastic gradient descent (SGD) is used in the updating of actor parameters (is the learning rate),
where is the cumulative discounted reward and denotes the decaying factor. Then critic calibrates the gradient by adding a baseline reward using value network . Consider the simple and robust Monte Carlo method (for the advantage part), formally,
Asynchronous advantage actor-critic (A3C)  is a distributed variant of actor-critic model. In A3C, there is one global and several local networks. The local networks copy global parameters periodically, and they run in parrellel by updating gradients to global net asynchronously.
4.2 MoTiAC with Reward Partition
As is stated in 3.2, RTB business shall require multiple objectives. A natural way is to linearly integrate them into a single one, and we call it Reward Combination. However, it is usually ineffective in most real-world cases . Thus, we are motivated to propose Reward Partition. In this subsection, we consider the general -objective case.
Reward Combination. One intuitive way  of handling multiple objectives is to (i) firstly compute a linear combination of the rewards, in which case each element of quantifies the relative importance of the corresponding objectives: ; (ii) and then define a single-objective agent with the expected return equals to value function .
However, these implementations ignore the premise that one or both of the steps is infeasible or undesirable in real practice : (i) A weighted combination is only valid when objectives do not compete . However, in RTB setting, relation between objectives can be complicated, and they usually conflict in terms of different sides. (ii) The intuitive combination might flatten the gradient with respect to each objective, and thus the agent is likely to limit itself within a narrow boundary of search space. (iii) A pre-defined combination may not be flexible in some cases, especially in the changing environment. Overall, Reward Combination is unstable and inappropriate in RTB problem .
Reward Partition. We therefore propose Reward Partition scheme. In our architecture, we design reward for each objective and employ one group of local networks on a feature subset with that reward. Note that all the local network share the same structure. There is one global network with an actor and multiple critics in our model. At the start of one iteration, each local network copies parameters from global network and begins exploration. Local networks from each group then will explore based on their own objective and push weighted gradients to the actor and one of the critics (partial update) in the global network asynchronously (in Fig. 3).
Formally, we denote the total utility and value function of the group () as and , respectively. The parameter updating can be formulated as,
Motivated by Bayesian RL , we parameterize by introduing a latent multinomial variable with under that trajectory . We call it agent’s prior. In the beginning, we set the initial prior as,
where trajectory just begins. When is up to state , we update the posterior using Bayes’ rule,
where tells how well the current trajectory agrees with the utility of objective .
Our scheme shows several advantages. First, different rewards are not explicitly combined in the model, and thus the conflicts should be addressed to some extent. Second, each local network aims at only one objective, so the model could explore in a relatively larger space. Third, we do not use a complex reward combination, which is usually hard to learn in most of the real cases . Instead, we use multiple value functions to approximate multiple single rewards in subsets of features, making critics easy to learn.
4.3 Analysis of MoTiAC
In this section, we will view our model from mathematical perspective and provide sound properties of MoTiAC. Firstly, we show that if we attach the weights of Reward Combination to the gradients in Reward Partition, the result of parameters updating should be identical on average. Secondly, we prove that with the specially designed agent’s prior (in Eqn. (9)), our model will converge to Pareto optimality.
Gradient Analysis. For Reward Combination, we have mentioned that rewards of different objectives are linearly aggregated by weight . Like Eqn. (5), by applying standard SGD (Eqn. (3)), the parameter is updated,
while in Reward Partition, each group of local networks update gradients w.r.t their own objectives. If the updating manner follows the same weights, we can easily aggregate Eqn. (6). Then, the expectation of gradient is given by,
By comparing the updating formulas (of Reward Combination’s and Reward Partition’s), we find the differenece lies on the advantage part and that the effect of update depends exactly on how well the critic(s) can learn from its reward(s). By learning in a decomposed level, Reward Partition advances the Reward Combination by using easy-to-learn functions to approximate single rewards, and thus yields a better policy.
Convergence Analysis. Next, we prove that the global policy will converge to the Pareto optimality between these objectives. The utility expectation of the objective is denoted as . We begin the analysis with Theorem 1,
(Pareto Optimality). If is a Pareto optimal policy, then for any other policy , one can at least find one k, so that and,
The multi-objective setting assumes that the possible policy set spans a convex space (-simplices). Based on Theorem 1
, the optimal policy of any affine interpolation of objective utility will be also optimal. We restate in Theorem 2 by only considering the non-negative region.
is Pareto optimal iff there exits such that,
We derive the gradient by aggregating Eqn. (6) as,
In the experiment, we use real-world industry data to answer the following questions. Q1: how well can MoTiAC perform in general? Q2: what is the best way to combine multiple objectives? Q3: where and why does MoTiAC work?
|Date||# of Ads||# of clicks||# of conversions|
5.1 Experiment Setup
Dataset. In the experiment, the dataset () is collected from company T’s Ads bidding system, ranging from Jan. 7th 2019 to Jan. 11th 2019. There are nearly 10,000 ads in each day with huge volume of click and conversion logs. According to the real-world business, the bidding interval is set to be 10 minutes (144 bidding sessions for a day), which is much shorter than 1 hour . Basic statistics can be found in Table 1.
In the evaluation, huge memory load is required. We implement all the methods with PyTorch on two 128 GB memory machines with 56 CPUs. We perform a five-fold cross-validation, i.e., using 4 days for training and another one day for testing, and then report the averaged results. Similar settings can be found in literatures[3, 21, 29].
5.2 Compared Baselines
We carefully select related methods for comparison, and adopt the same settings for all compared methods with 200 iterations. Proportional-Integral-Derivative (PID)  is a widely used feedback control policy, which produces the control signal from a linear combination of the proportional, the integral and the derivative factor. PID is free from training. In company T’s online ad system, PID is currently used to control bidding. We employ it as a standard baseline and will show relative experiment result with respect to it. Two state-of-the-art RTB methods are selected: Reinforcement Learning to Bid (RLB) , Distributed Coordinated Multi-Agent Bidding (DCMAB) . Since they are both based on DQN, we use the same discrete action space (interval is 0.01) like . Aggregated A3C (Agg-A3C)  is a standard A3C with linear combined reward, and we implement it to compare with MoTiAC and show the superiority of our Reward Partition schema. In the experiment, without loss of generality we linearly combine multiple rewards (following Reward Combination) for all baselines. We also adopt two simple variants of our model by only considering one of the objectives: Objective1-A3C (O1-A3C) and Objective2-A3C (O2-A3C). We denote our model as Multi-objective Actor-Critics (MoTiAC).
5.3 Evaluation Metrics
In Sec. 3.2, we have claimed that agent’s goal is to (i) first minimize cost and (ii) then maximize total revenue. In the experiments, we refer to the industrial convention and re-define our goal to be maximizing both revenue and return on investment (ROI). We give detailed introduction of two terms below.
Revenue. Revenue is a frequently used indicator for advertiser’s earnings, and it turns out to be proportional w.r.t conversions (for the ad, ). In the experimental comparison, we will use total of all ads to show the achievement of the first objective.
ROI. Cost is the amount of money invested by advertisers, and it is defined by in this setting. The ratio is so-called return on investment (ROI), showing the joint benefits of all advertisers. We use it to indicate our second objective.
To make it easy to compare, we also use R-score proposed in  to evaluate the model performance. The higher the R-score, the more satisfactory the advertisers and platform will be. Note that most of the comparison result will be based on PID, i.e., , except for Sec. 5.6.
|Model||Relative Revenue||Relative ROI||R-score|
|DCMAB||1.0019 (+0.19%)||0.9665 (-3.35%)||0.9742|
|RLB||0.9840 (-1.60%)||1.0076 (+0.76%)||0.9966|
|Agg-A3C||1.0625 (+6.25%)||0.9802 (-1.98%)||0.9929|
|O1-A3C||0.9744 (-2.56%)||1.0170 (+1.70%)||1.0070|
|O2-A3C||1.0645 (+6.45%)||0.9774 (-2.26%)||0.9893|
|MoTiAC||1.0421 (+4.21%)||1.0267 (+2.67%)||1.0203|
5.4 AtoQ1: General Experiment
We report basic comparison of MoTiAC and other approaches in Table 2. Note that results are on the basis of PID, and values in parentheses show improvement/reduction percentage.
Result. In general, it is obvious that MoTiAC outperforms all of the baselines in terms of Revenue and ROI and also achieves the highest overall R-score. DCMAB is shown to be the worst one relatively, though it gains a slightly higher Revenue (first objective) over PID, but the cumulative ROI in this method is much lower comparing to PID, which is unaccepted. The reason might be that ads are hard to cluster in RTB dynamics, so that multi-agents in DCMAB cannot take its advantage. Similarly, RLB gives a benign performance. We deem that the intrinsic dynamic programming algorithm tends to be convservative. It seems to give up less profitable Revenue in order to maintain the overall ROI. These two methods also show that discrete action space is not an optimal setting in this problem. By solely applying the weighted sum in a standard A3C (Agg-A3C), the poor result is not surprising. Because RTB environment varies a lot, fixing the formula of reward aggregation cannot capture that.
It is worth noticing that two ablation models O1-A3C and O2-A3C present two extreme situations. O1-A3C performs well in the second objective, but performs poorly for the first goal, and vice versa for O2-A3C. By shifting the priority of different objectives over time, our proposed MoTiAC uses agent’s prior as a reference to make decision in the future, which exactly captures the dynamics of RTB sequence. Therefore, it outperforms all the baselines.
5.5 AtoQ2: Variation of
To give a comprehensive view of MoTiAC, we have tried different ways to aggregate objectives in our model. We mainly consider and present the interesting variants of .
Variation of . In this experiment, four variants are considered. Since we have two objectives, we use for the first objective and for the second:
equal priority: ;
changing priority: with a scalar ;
random priority: ;
Bayesian priority: One can refer to Eqn. (9).
As is shown in Fig. 4, we present training curves for ROI and Revenue. The first three strategies are designed before training and will not adjust according to the changing environment. It turns out that they perform similarly in both objectives and could gain a decent improvement over PID case by around +2.5% in ROI and +3% in Revenue. To be more specific, in equal priority, curve of ROI generally drops when the iteration goes up, which stems from the fact that fixed equal weights cannot fit the dynamic environment. For changing priority, it is interesting that ROI first increases then decreases with respect to priority shifting. Because different priority leads to different optimal. In random priority, curves turns out to dramatically change in a small range, since priority also fluctuates in random. The Bayesian priority case, one the contrary, sets priority based on the conformity of agent’s prior (learned from the history) and current state. Reward partition with agent’s prior dominates the first three strategies by an increasingly higher ROI achievement by +2.7% and better Revenue by around +4.2%.
|Model||Revenue (CNY)||Cost (CNY)||ROI|
5.6 AtoQ3: Case Study
In this section, we try to understand where and why MoTiAC works well in RTB problem. We choose two typical ads with large conversions and show the bidding process within 24 hours. As PID is the current model in our real ad system, we use PID to compare with our model and draw the results of ROI and Revenue curve in Fig. 5 and Fig. 6. We also collect the final results in Table. 3. Note that in real-world business, only the final number matters. Therefore, in this problem, we only care about the final results.
Fig. 5 shows the ad’s response to the PID and MoTiAC model. For the Revenue curve, both of them rise with respect to time intuitively, and the result of MoTiAC dominates that of PID. For the ROI curve, it fluctuates a lot in the beginning. It is easy to know that PID model is great enough to adjust this metric through negative feedback mechanism. It drags the red dashdot line quickly towards 1.0, and afterwards maintains this trend. The grey solid line is the process of MoTiAC, we can observe that MoTiAC tries to lift the very low value at first, then it starts to explore (maintain a relatively low ROI) at around 6h. However, at the end of the day, these two models reach a similar ROI at around 1.0 (desirable result in RTB).
Fig. 6 shows a different ad with a pretty low ROI initially. For this ad, both models will firstly try to lift the ROI. Based on the figures presented left, the red dashed curve rises up from 0 to about 0.7 sharply for PID at time 8h. The potential process should be that PID has given up most of the bid chances and only concentrates on those with high conversion rate (CVR), so that we have witnessed a low Revenue gain of PID model in the right figure from 8h to around 21h. Though ROI curve remains in a relatively low position, our MoTiAC is able to select good impression-level chances at that situation while still considering the another objective. At 24h, both models cannot adjust ROI up to 1.0, but MoTiAC finally surpasses PID in this metric because of the high volume of pre-gained Revenue. In sum, with long-term consideration, MoTiAC beats PID on both the cumulative ROI and Revenue.
We conclude that PID is greedy out of the immediate feedback mechanism, and it always concerns with the current situation and never considers further benefits. When the current state is under control as shown in Fig. 5 (after 4h), PID will appear to be conservative and give short-sighted strategy, which usually results in a seemingly good ROI and a poor Revenue (like the red curve in Fig. 6). However, our model MoTiAC possesses an overall perspective, it foresees the long-run benefit and will keep exploration even temporarily deviating from the right direction (ROI curve for the ad after 3h) or slowing down the rising pace ( ROI curve for the ad at 8h). Under a global overview, MoTiAC can finally reach a similar ROI but better Revenue than PID.
6 Conclusion and Future Directions
In this paper, we propose Multi-Objective Actor-Critics (in short, MoTiAC) for bidding optimization in RTB system. To our best knowledge, MoTiAC is the first to utilize specialized actor-critics to solve the problem of multi-objective bid optimization. By learning priors from history data, our model is able to follow adaptive strategies in a dynamic RTB environment and outputs the optimal bidding policy. We conduct extensive experiments on real-world industrial dataset and provide interesting analysis of model properties, especially the convergence of Pareto optimality. Empirical results shows that in off-line ad click data, MoTiAC outperforms the state-of-the-art bidding algorithms and can generate +4.2% lift in revenue and +2.7% in ROI for T’s advertising platform.
One future direction could be to extend multi-objective solution with priors in multi-agent reinforcement learning area. Another possible direction is in applying this method into other real-world RL applications.
-  (2019) Dynamic weights in multi-objective deep reinforcement learning. In ICML, pp. 11–20. Cited by: §2.
-  (1993) Development of the pid controller. IEEE control systems, pp. 58–62. Cited by: §5.2.
-  (2017) Real-time bidding by reinforcement learning in display advertising. In WSDM, pp. 661–670. Cited by: §1, §2, §5.1, §5.2.
-  (2017) Toward negotiable reinforcement learning: shifting priorities in pareto optimal sequential decision-making. arXiv:1701.01302. Cited by: §4.3.
Bayesian reinforcement learning: a survey.
Foundations and Trends® in Machine Learning8 (5-6), pp. 359–483. Cited by: §4.2.
-  (2017) DeepFM: A factorization-machine based neural network for CTR prediction. In IJCAI, pp. 1725–1731. Cited by: §2.
-  (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. arXiv preprint arXiv:1802.09756. Cited by: §1, §2, §3.1, §4.1, §5.1, §5.2.
-  (2000) Actor-critic algorithms. In NeurIPS, pp. 1008–1014. Cited by: §2, §4.1, §4.1.
-  (2015) Continuous control with deep reinforcement learning. arXiv:1509.02971. Cited by: §2, §4.1.
-  (2019) Reinforcement learning with sequential information clustering in real-time bidding. In CIKM, pp. 1633–1641. Cited by: §2, §5.3.
-  (2016) Asynchronous methods for deep reinforcement learning. In ICML, pp. 1928–1937. Cited by: §1, §2, §4.1, §4.1, §5.2.
-  (2013) Playing atari with deep reinforcement learning. arXiv:1312.5602. Cited by: §2.
-  (2018) Multi-reward reinforced summarization with saliency and entailment. arXiv:1804.06451. Cited by: §2, §4.2.
-  (2012) Bid optimizing and inventory scoring in targeted online advertising. In SIGKDD, pp. 804–812. Cited by: §2.
-  (2015) Multi-objective reinforcement learning with continuous pareto frontier approximation. In AAAI, Cited by: §2.
-  (2013) A survey of multi-objective sequential decision-making. JAIR 48, pp. 67–113. Cited by: §2, §4.2, §4.2, §4.2.
-  (2018) Multi-task learning as multi-objective optimization. In NeurIPS, pp. 525–536. Cited by: §4.2.
-  (2000) Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, pp. 1057–1063. Cited by: §2.
-  (2017) Hybrid reward architecture for reinforcement learning. In NeurIPS, pp. 5392–5402. Cited by: §4.2, §4.2.
-  (2007) Position auctions. IJIO 25 (6), pp. 1163–1178. Cited by: §1.
-  (2017) LADDER: a human-level bidding agent for large-scale real-time online auctions. arXiv:1708.05565. Cited by: §1, §2, 2nd item, §5.1, §5.2.
-  (2018) Budget constrained bidding by model-free reinforcement learning in display advertising. In CIKM, Cited by: §2.
-  (2015) Smart pacing for effective online ad campaign optimization. In SIGKDD, pp. 2217–2226. Cited by: §2.
-  (2019) A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In NeurIPS, Cited by: §2.
-  (2018) A nonparametric delayed feedback model for conversion rate prediction. arXiv:1802.00255. Cited by: §2.
-  (2013) Real-time bidding for online advertising: measurement and analysis. In ADKDD, pp. 3. Cited by: §1, §1.
-  (2014) Optimal real-time bidding for display advertising. In SIGKDD, pp. 1077–1086. Cited by: §1, §2, §3.1.
-  (2018) Deep reinforcement learning for sponsored search real-time bidding. arXiv preprint arXiv:1803.00259. Cited by: §1.
-  (2017) Optimized cost per click in taobao display advertising. In SIGKDD, pp. 2191–2200. Cited by: §1, §5.1.