1 Introduction
In Ecommerce, online advertising plays an essential role for merchants to reach their target users, in which Realtime Bidding (RTB) (Zhang et al., 2014, 2016; Zhu et al., 2017) is an important mechanism. In RTB, each advertiser is allowed to bid for every individual ad impression opportunity. Within a period of time, there are a number of impression opportunities (user requests) arriving sequentially. For each impression, each advertiser offers a bid based on the impression value (e.g., revenue) and competes with other bidders in realtime. The advertiser with the highest bid wins the auction and thus display ad and enjoys the impression value. Displaying an ad also associates with a cost: in Generalized SecondPrice (GSP) Auction (Edelman et al., 2007), the winner is charged for fees according to the second highest bid. The typical advertising objective for an advertiser is to maximize its cumulative revenue of winning impressions over a time period under a fixed budget constraint.
In a digital age, to drive conversion, advertisers can reach and influence users across various channels such as display ad, social ad, paid search ad (Ren et al., 2018). As illustrated in Figure 9, the user’s decision to convert (purchase a product) is usually driven by multiple interactions with ads. Each ad exposure would influence the user’s preferences and interests, and therefore contributes to the final conversion. However, existing advertising systems (Yuan et al., 2013; Zhang et al., 2014; Ren et al., 2017; Zhu et al., 2017; Jin et al., 2018; Ren et al., 2019) mainly focus on maximizing the singlestep revenue, while ignoring the contribution of previous exposure to the final conversion, and thus usually falls into suboptimal solutions. The reason is that simply optimizing the total immediate revenue cannot guarantee the maximazation of longterm cumulative revenue. Besides, there exist some works (Boutilier & Lu, 2016; Du et al., 2017; Cai et al., 2017; Wu et al., 2018)
which optimize the overall revenue under an extralong (billions) request sequence using a single Constrained Markov Decision Process (CMDP)
(Altman, 1999). However, the optimization of these methods above is myopic as they ignore the mental evolution of each user and longterm advertising effects. The learning is particularly inefficient as well.Apart from the myopic approaches, there exists some literatures considering the longterm effect of each ad exposure. Multitouch attribution (MTA) (Ji & Wang, 2017; Ren et al., 2018; Du et al., 2019) study the credits assignment to the previous ad displays before conversion. However, these methods only attend to figure out the contribution of each ad exposure, while not providing methods to optimize the strategies. Besides, since all media channels could affect users’ conversions, Li et al. (2018); Nuara et al. (2019) propose multichannel budget allocation algorithms to help advertisers understand how particular channels contribute to user conversions. They optimize the budget allocation among all channels accordingly to maximize the overall revenue. However, the granularity of their optimizations is too coarse. They only optimize the budget allocation in the channel level and do not specifically optimize the advertising sequence for each user, which could lead to suboptimal overall performance.
Considering the shortcomings of existing works, we aim at optimizing the budget allocation of an advertiser among all users such that the cumulative revenue of the advertiser could be maximized, by explicitly taking into consideration the longterm influence of ad exposures to individual users. This problem consists of two levels of coupled optimization: bidding strategy learning for each user and budget allocation among users, which we termed as Dynamic Knapsack Problem. Different from traditional Knapsack problem, a number of challenges arise: 1) Given the estimated longterm value and cost for each user, the optimization space of the budget allocation grows exponentially in the number of users. Besides, since different advertising policies for each user will lead to different longterm values and costs, the overall optimization space is extremely large. 2) The longterm cumulative value and cost for each user are unknown, which are difficult to make accurate estimations.
To address the above challenges, we propose a novel bilevel optimization framework: Multichannel Sequential Budget Constrained Bidding (MSBCB), which transforms the original bilevel optimization problem into an equivalent twolevel optimization with significantly reduced searching space. The higherlevel only needs to optimize over one dimensional variable and the lowerlevel learns the optimal bidding policy for each user and computes the corresponding optimal budget allocation solution. For the lowerlevel, we derive an optimal reward function with theoretical guarantee. Besides, we also propose an action space reduction approach to significantly increase the learning efficiency of the lowerlevel. Finally, extensive offline analyses and online A/B testing conducted on one of the world’s largest Ecommerce platforms, Taobao, show the superior performance of our algorithm over stateoftheart baselines.
2 Formulation: Dynamic Knapsack Problem
Within a time period of days, we assume that there are users visiting the Ecommerce platform. Each user may interact with the app multiple times and trigger multiple advertising requests. During the sequential interactions between an ad and a user, each ad exposure could influence the user’s mind and therefore contributes to the final conversion. Given a fixed ad, for each user , we build a separate Markov Decision Process (MDP) (Sutton & Barto, 2018) to model its sequential interactions with the same ad. We use to denote the advertising policy of the ad towards user , which takes user ’s state as input and outputs the auction bid. Details of the MDP will be discussed in Section 3.2. For the fixed ad, we define and as the expected longterm cumulative value and cost for each user under policy . Formally,
(1)  
where and represent the value (i.e., the revenue) and cost obtained from each request according to policy , and represent the longterm cumulative value and cumulative cost, is the length of the interaction sequence between user and the current ad.
Given the above definitions, for an advertiser, our target is to maximize its longterm cumulative revenue over days under a budget constraint , which is formulated as:
(2)  
where , , and indicates whether the user is selected. Since whether displaying an ad to user does not have any impact on user ’s behaviors, , and among different users are independent. Thus, given any fixed advertising policy , and for each user are fixed and the inner optimization of Equation (2) can be viewed as a classic knapsack problem. The items to be put into the knapsack is the users. However, different advertising policies would lead to different s and s for each user, thus here we define Equation (2) as a Dynamic Knapsack Problem where the value and cost of each item in the knapsack are dynamic. From the perspective of optimization, Formulation (2) is a typical bilevel optimization, where the optimization of is embedded (nested) within the optimization of . This bilevel optimization is challenging due to the following reasons:

The optimization space of the joint is continuous (for the bid space is continuous). The optimization space of is discrete, which grows exponentially in the number of users (hundreds of millions). Therefore, the solution space of the combination of and is enormous and thus is difficult or even impossible to optimize directly.

The value of and are unknown and variable, efficient approaches are required to estimate these values online under limited samples.
3 Methodology: MSBCB Framework
3.1 Bilevel Decomposition and Proof of Correctness
Based on the above analysis, the bilevel optimization (2) is computationally prohibitive and cannot be solved directly. In this paper, we first decompose it into an equivalent twolevel sequential optimization process. When taking a fixed policy as input, we denote the optimal solution of the degraded and static Knapsack Problem as . Further, the global optimal solution of Problem (2) could be defined as:
(3) 
where are independent variables and is the global optimal solution. To obtain , we must firstly specify the form of the function .
When taking a fixed policy as input, computing is a classic static knapsack problem. However, another challenge in online advertising is that the user requests are arriving sequentially in real time and thus realtime decision makings are required. Complicated algorithms (e.g. dynamic programming) are not applicable due to the incompleteness of all users’ values and costs.
On the contrary, the Greedy algorithm could compute a greedy solution without completely knowing the whole set of candidate users beforehand. We will discuss this latter. Besides, the Greedy algorithm can achieve nearly optimal solution in the online advertising (Zhang et al., 2014; Wu et al., 2018). As proved by Dantzig (1957), if , , i.e., the cumulative cost for each user is much less than the budget, the Greedy algorithm achieves an approximation ratio of , which means the greedy solution is at least times of the optimal solution . The closer the gets to 1, the higher the quality of the greedy solution will be. In online advertising, is usually greater than 99.9%. Thus, the greedy solution is approximately optimal. We provide the detailed data and proof in Section B.1 of the Appendix. Therefore, in this paper, we refer to the Greedy algorithm, i.e., .
We define as the CostPerformance Ratio of each user . The greedy solution is computed by:

Sorting all users according to the CostPerformance Ratio in a descending order;

Pick users from top to bottom until the cumulative cost violates the budget constraint.
An illustration is shown in Figure 2. In this example, the budget constraint . We denote the of the last picked user as , the threshold of the costperformance ratio. In this example, the . The advantage is that the Greedy algorithm only selects users whose . If we could estimate the beforehand, the Greedy algorithm could compute the solution online, without completely knowing the values and costs of all users.
Now that and the Greedy algorithm prefers users with larger (only pick users whose ), according to Equation 3, to further improve the solution quality, an intuitive way is to optimize for each user such that each could be maximized, i.e., . However, this intuition is incorrect. Maximizing the of each user cannot guarantee that the greedy solution could be maximized. Next, we show that given all users’ CPRs are maximized, we can still further improve the solution quality by increasing certain users’ allocated budgets and decreasing their CPRs in exchange for greater overall cumulative value. Before we go into the details, we firstly give Lemma 1.
Lemma 1. For each user , the cumulative value increases monotonically with the increase of cost within the range of all possible optimal policies .
Proof. We assume that the maximum budget allocated to each user as , where is the maximum cost user can consume. Then, for each user , within the current budget constraint , the optimal advertising policy must be the one which could maximize the cumulative value, i.e., . Obviously, as moves from to , we will get a set of optimal policies , whose cost and value are both increasing. An illustration is shown in Figure 3. Thus we complete the proof.
As illustrated in Figure 4, each user’s (the width of each rectangular slice) is maximized initially. According to Lemma 1, for a user , if we increase by , i.e., increase the height of user by , the corresponding will also increase. We denote this increase in value as . Since there is a budget limit, a small increased height will squeeze out a small area nearby the , whose height is also and width is ^{1}^{1}1Since , the area squeezed out could be considered as a tiny and smooth change and the width of the last user is approximately equal to . We denote the increased area by reshaping user as and the decreased area due to extrusion as . Overall, if , the total area will be further increased. For any user , yields:
(4) 
where and are caused by the change of , e.g., from to . We denote as and as . We conclude that the greedy solution can be further improved if there exists any user whose current policy can be further improved to such that . Otherwise, the current solution is optimal. Finally, we provide the definition of the optimal in Theorem 1.
Theorem 1. Under the Greedy paradigm (), for any given , the optimal advertising policy for each user is the one which could maximize . In other words, is defined as:
(5) 
We denote . The corresponding solution is the optimal Greedy solution of the Dynamic Knapsack Problem defined in Equation (2).
Proof of Theorem 1. We define , where is defined according to Equation (5), . We prove Theorem 1 by contradiction. Given the threshold , we firstly assume that is not the optimal greedy solution of the Dynamic Knapsack Problem, which means we could at least find a user , whose policy could be further improved to policy such that the overall area is increased. This means we could find a better policy for user such that according to Equation (4), where and ( increases monotonically with the increase of according to Lemma 1). Further, yields:
(6)  
Equation (6) indicates that
which contradicts the definition of in Equation (5). Thus, the theorem statement is obtained.
We present the overall MSBCB framework in Algorithm 1, which involves a twolevel sequential optimization process. (1) Lowerlevel: Given any , we could obtain the optimal advertising policy following Equation 5 of Theorem 1, which will be discussed in Section 3.2. Then, based on and the optimized , we could acquire the Greedy solution by selecting users whose , which will be detailed in Section 3.3. (2) Higherlevel: However, the current might , which means selecting all users whose might violate the budget constraint or lead to a substantial budget surplus. Thus, we optimize the current towards in Section 3.4. Overall, the optimization space of is reduced from to a onedimensional continuous variable . We conclude that Algorithm 1 could iteratively converge to a unique and approximate optimal solution. We present the proof of convergence in Section B.3 of the Appendix.
3.2 Lowerlevel Advertising Policy Optimization with Reinforcement Learning
Given a threshold as input, we aim to acquire the optimal advertising policy defined in Equation (5) of Theorem 1. Combining the definitions of and with Equation (5), we have
(7)  
Accordingly, we define , i.e., , as the immediate profit acquired at each step . The objective of Equation (7) is to obtain the optimal advertising policy which could maximize the expected longterm cumulative profit. To solve this sequential decision making problem, we formulate it as an MDP and use Reinforcement Learning (RL) (Sutton & Barto, 2018) techniques to acquire the optimal policy .
We consider an episodic MDP, where an episode starts with the first interaction between a user and an ad, and ends up with a purchase or exceeding the maximum step as:

State : The state should in principle reflect the user request status, ad info, userad interaction history info and the RTB environment.

Action : The action each agent can take in the RTB platform is the bid, which is a real number between 0 and the upper bound , i.e., .

Reward : The immediate reward at step is defined as .

Transition probability
: Transition probability is defined as the probability of state transitioning from
to when taking action . 
Discount factor : The bidding agent aims to maximize the total discounted reward from step onwards, where .
For each user , we define the stateaction value function as the expected cumulative reward achieved by following the advertising policy . The MDP can be solved using existing Deep Reinforcement Learning (DRL) algorithms such as DQN (Mnih et al., 2013), DDPG (Lillicrap et al., 2015) and PPO (Schulman et al., 2017). After sufficient training, we would acquire the optimized advertising policies for all users.
3.3 Lowerlevel User Selection by Greedy Algorithm
Taking the current and the optimized advertising policies as inputs, we aim to obtain the greedy solution of the Dynamic Knapsack Problem. In reality, we cannot know all users’ request sequences and their values and costs beforehand because the user requests are arriving sequentially in real time. Thus, many complicated methods depending on the completeness of all users’ data, e.g., the dynamic programming approach (Martello et al., 1999), are not applicable. Even the traditional Greedy algorithm cannot be applied either. Fortunately, the greedy solution could be computed online in an easy way: given the threshold , the agent only has to select users online whose CPRs are greater than the threshold (an illustration is shown in Figure 2). Therefore, we only have to estimate the for each user . To acquire and , besides Q(s,a), we also maintain two other state value functions and according to the Bellman Equation (Sutton & Barto, 2018), where and .
3.4 Higherlevel Optimization by Feedback Control
However, the current estimated threshold might have some bias from the optimal . Thus, selecting all users whose might violate the budget constraint or lead to a substantial budget surplus. Only when the estimated is exactly the same with the optimal , the actual total advertising cost will be equal to the budget. To achieve this, we design a feedback control mechanism, i.e., a PID controller (Åström & Hägglund, 1995), to dynamically adjust the towards according to actual feedback of the overall cost. The core formula is:
(8) 
where is the actual feedback cost of the current period, is the budget, and are the overall cost and the overall budget of the most recent periods. and are two learning rates. The main idea is when the actual cost exceeds (is less than) the budget, the threshold will be increased (decreased) accordingly such that less (more) users will be selected, which will reduce (increase) the cost in turn. The first term is designed to keep up with the latest changes. The second term is designed to stabilize learning.
3.5 Action Space Reduction for RL in Advertising
However, when applying the RL approaches mentioned in Section 3.2 to online advertising, one typical issue is that the sample utilization is inefficient. The main reason is that the action space of the agent is continuous, thus the range of needs to be fully explored in all states. To resolve this problem, we reduce the magnitude of the continuous action space (i.e., ) to a binary one (i.e., ) by making full use of the prior knowledge in advertising, which greatly improves the sample utilization of the RL approaches. Specifically, since different bids can only result in two different outcomes , where or 0 indicates whether the ad is displayed to the user, we only have to evaluate the different expected returns resulted by or for . We denote the greedy action based on the current value estimations as:
(9) 
Then, to obtain an executable bid, for , we could offer a low enough bid, e.g., , to make sure that it is impossible to win the auction. For , we propose an optimal bid function which could output a bid greater than the second highest bid while not overbidding.
In detail, we maintain two stateaction value functions and . Since the reward function is defined as , we have . Then yields:
(10)  
If , the expected immediate cost is 0 (since the ad is not exposed). If , we denote the expected immediate cost as , whose value depends on the pricing model. In online advertising, typical pricing models includes CPM (Cost Per Mille, the advertiser bid for impressions and is charged based on impressions), CPC (Cost Per Click, the advertiser bid for clicks and is charged based on clicks) and CPS (Cost Per Sales, the advertiser bid for conversions and is charged based on conversions). If CPM is used, , where denotes the second highest bid in the auction. If CPC is used, , where pCTR represents the predicted ClickThrough Rate. If CPS is used, , where pCVR represents the predicted Conversion Rate. For ease of presentation, we take CPM for an example. Under CPM,
(11)  
Notice that the second highest bid is unknown until the current auction is finished. Substituting Equation (11) into Equation (10), we acquire
(12)  
where . We denote the term on the right of the ’’ in Equation (12) as . And we conclude that the bidding agent can always set the bid price during the online bidding phase, which is the optimal action without any loss of accuracy. Refer to Section B.2 of the Appendix for proof. For CPC or CPS, the optimal bid formula can be easily acquired by substituting the corresponding into Equation 11. Here, we reaffirm that our action space reduction technique is a generalized design and is applicable to different pricing models.
4 Empirical Evaluation: Simulations
We start with designing simulation experiments to shed light on the contributions of the proposed framework MSBCB under more controlled settings. Similar to the simulation settings of (Ie et al., 2019), we assume there are a set of users , a set of ads and a set of commodity categories . Each ad has an associated category. Each user has various degrees of interests in commodity categories, which is influenced by the displayed ad. When user consumes ad , his interest in category is nudged stochastically, biased slightly towards increasing his interest, but allows some chance of decreasing his interest. We set , and in the following experiments. Detailed settings of the simulation environment can be found in Section D of the Appendix.
4.1 Baselines
We compare our MCBCB with following baseline strategies:

Myopic Approaches: (1) Manual Bid is a strategy that the agent continuously bids at the same price initialized by the advertiser. (2) Contextual Bandit (Zhang et al., 2014) aims at maximizing the accumulated shortterm value of each request based on the Greedy framework.

Greedy with maximized CPR: This approach is similar to our method under the Greedy framework except that each is optimized by maximizing the longterm CPR. In the offline simulation, we enumerate all policies for each user and select the one which could maximize its CPR. This approach is named as Greedy+maxCPR.

Greedy with stateoftheart RL approaches: These baselines, i.e., Greedy+DQN, Greedy+DDPG and Greedy+PPO, utilize the same reward function with our MSBCB to optimize the lowerlevel optimization of . The difference is that our MSBCB leverages the action space reduction technique. For DQN and PPO, we discretize the bid action space evenly into 11 real numbers as the valid actions.

Undecomposed Optimization: These baselines are RL approaches (DQN,DDPG and PPO) based on the Constrained Markov Decision Process (CMDP). They are named as Constrained+DQN, Constrained+DDPG, Constrained+PPO respectively. We follow the CMDP design and settings in (Wu et al., 2018).

Offline Optimal: The optimal solution of the Dynamic Knapsack Problem can be computed by dynamic programming in offline simulation because we could enumerate all possible policies to get the corresponding longterm values and costs for each user. Note that since users’ request sequences are unknown beforehand and there is only one chance for the ad to bid for each request in the online advertising systems, the optimal solution can only be obtained in offline simulation.
4.2 Experimental Results
We conduct extensive analysis of our MSBCB in the following 5 aspects. All approaches aim to maximize the advertiser’s cumulative revenue under a fixed budget constraint. All experimental results are averaged over 10 runs. The hyperparameters for each algorithm are set to the best we found after gridsearch optimization.
Myopic vs Nonmyopic. To show the benefits of upgrading the myopic advertising system into a farsighted one, we compare the cumulative revenue achieved by our MSBCB with two other myopic baselines. The learning curves and results are shown in Figure 5 and Table 1. We see that MSBCB outperforms the Manual Bid and the Contextual Bandit by a large margin, which indicates that taking account of the longterm effect of each ad exposure could significantly improve the cumulative advertising results.
MSBCB vs the Offline Optimal. In Figure 5, we also compare our MSBCB with the Offline Optimal, which is computed by a modified dynamic programming algorithm. We see that as the training continues, our MSBCB gradually achieves an approximately optimal solution. Detailed results are summarized in Table 1. Our MSBCB empirically achieves an approximation ratio of 98.53%(0.36%).
MSBCB vs Greedy with maximized CPR. As discussed in Section 3.1, under the Greedy framework, maximizing each user’s cannot guarantee that the greedy solution of the Dynamic Knapsack Problem (2) could be maximized. The optimal advertising policy for each user is given by Theorem 1. To experimentally verify the correctness of Theorem 1, we compare the cumulative revenue achieved by MSBCB and the Greedy with maximized CPR. As shown in Figure 6 and Table 1, MSBCB outperforms Greedy with maximized CPR and achieves a improvement.
MSBCB vs Greedy with stateoftheart RL approaches. Besides, to show the effectiveness of the actionspace reduction proposed in Section 3.5, we compare MSBCB with the stateoftheart DRL approaches under the Greedy framework. As shown in Figure 6 and Table 1, MSBCB outperforms Greedy+DQN, Greedy+DDPG and Greedy+PPO both in the cumulative revenue and the convergence speed, which shows that the action space reduction effectively improves the sample efficiency of RL approaches.
Decomposed MSBCB vs Undecomposed optimization. Similar to (Wu et al., 2018), the undecomposed optimization baselines consider all users’ requests as a whole and model the budget allocations among all request as a CMDP. As shown in Figure 7 and Table 1, MSBCB outperforms the CMDP based RL approaches by a large margin. The reason of the poor performance in CMDPbased approaches is that these methods model all users’ requests as a whole sequence and thus the learning process is particularly inefficient. In contrast, our MSBCB decomposes the whole sequence optimization into an efficient twolevel optimization process, thus can achieve better performance more easily.
Method  Revenue  Cost  Revenue Impro  Approximation Ratio 

Manual Bid  38838.28  11995.10  48.31%  43.5% 
Contextual Bandit  75137.30  11995.46  0%  84.15% 
Constrained + PPO  61890.92  11954.07  17.6316.11%  69.3113.56% 
Constrained + DDPG  74259.12  11996.12  1.193.66%  83.173.08% 
Constrained + DQN  70662.65  11881.12  5.967.83%  79.146.59% 
Greedy + maxCPR  83668.70  11914.12  11.352.84%  93.702.36% 
Greedy + PPO  76970.35  11825.59  2.443.52%  86.202.93% 
Greedy + DDPG  80424.69  11841.28  7.041.13%  90.070.92% 
Greedy + DQN  84117.09  11794.24  11.954.96%  94.214.14% 
MSBCB  87947.99  11957.57  17.950.42%  98.500.33% 
MSBCB (enum)  89251.77  11988.36  18.78%  99.96% 
Offline Optimal  89291.11  11999.23  18.84%  100.00% 
The complete comparisons of all approaches are shown in Table 1. The budget constraint is set to 12000 for all experiments. In Table 1,we also add an MSBCB (enum), which is the theoretical upper bound of our MSBCB. The difference between MSBCB (enum) and MSBCB is that: the MSBCB (enum) computes the optimal advertising policy for each user by enumerating all possible policies. Instead of utilizing the RL approach, MSBCB (enum) could find the one which maximizes . We see MSBCB (enum) is very close to the optimal solution and reaches an approximation ratio of 99.96%.
4.3 Effectiveness of Action Space Reduction
As shown in Table 2
, MSBCB achieves a revenue of 75000 in only 61 epochs, reducing more than 60% samples compared with the stateoftheart RL baselines without using the actionspace reduction technique. As for learning process, our MSBCB achieves the same revenue (80000) more than 10 times faster than the baselines, reducing more than 90% samples and finally reaches the highest revenue. Thus, with the action space reduction technique, our MSBCB could reach a higher performance with a faster speed and significantly improve the sample efficiency. More analysis of our MSBCB, e.g., the convergence of
and , and the hyperparameter settings of the offline experiments are shown in Section D of the Appendix.Revenue  75000  80000  85000  

Method  #Epoch  #Samples  #Epoch  #Samples  #Epoch  #Samples 
Greedy+PPO  817  4183040         
Greedy+DDPG  154  788480  853  4362240     
Greedy+DQN  373  1909760  754  3855360     
MSBCB  61  312320  71  363520  104  532480 
5 Empirical Evaluation: Online A/B Testing
We deployed MSBCB on one of the world’s largest Ecommerce platforms, Taobao. Our platform is authorized by the advertisers to dynamically adjust their bid prices for each user request according its value in the realtime auction. In the online experiments, we compare MSBCB with two models widely used in the industry.

Cross Entropy Method (CEM), which is a deployed production model, whose target is to optimize the immediate rewards. We consider CEM as the control group in the following evaluations.

Contextual Bandit, which has been explained in previous section and is reserved as a contrast test.
The experiment involves 135,858,118 users and 72,147 ad items from 186 advertisers. For fair comparison, we control the consumers and the advertisers involved in the A/B testing to be homogeneous. In detail, the 135,858,118 users are randomly and evenly divided into 3 groups. For users in group #1, all 186 advertisers adopt the CEM algorithm. For users in group #2, all 186 advertisers adopt the Contextual Bandit algorithm. For users in group #3, all 186 advertisers adopt our MSBCB. Table 3 summarises the effects of the Contextual Bandit and our MSBCB compared to the Cross Entropy Method from Dec.10 to Dec.20 in 2019. From Table 3, we see that our MSBCB achieves a +10.08% improvement in revenue and a +10.31% improvement in ROI with almost the same cost (0.20%). The results indicate that upgrading the myopic advertising strategy into a farsighted one could significantly improves the cumulative revenue. Besides, as shown in Figure 8, the daily ROI improvement also demonstrates the effectiveness of our MSBCB compared with the Contextual Bandit.
Method  Revenue  Cost  CVR  #PV  ROI 

Contextual Bandit  +0.91%  3.26%  +4.78%  +4.62%  +4.31% 
MSBCB  +10.08%  0.20%  +6.04%  +15.37%  +10.31% 
Given that there are only 186 advertisers take part in our online experiment, one frequently asked question is“How does the MSBCB work across all ads?” Since 186 is relatively small compared with the total number of advertisers, their policy updates would not cause dramatic changes to the RTB environment. In other words, the RTB environment is still approximately stationary from a singlead perspective. This setting also works well with our practical business model–providing better service for VIP advertisers (about 0.2% of all the advertisers). In the case that the majority of the advertisers adopt MSBCB, the system cannot be estimated as being stationary from any singlead’s perspective and explicit multiagent modeling and coordination should be incorporated. Detailed analysis of the improvement in revenue for each advertiser is presented in Table 7 and Figure 19 of the Appendix. More details about the deployment and experimental results (e.g., the online model architecture) can also be found in Section C and E of the Appendix.
6 Conclusion
We formulate the multichannel sequential advertising problem as a Dynamic Knapsack Problem, whose target is to maximize the longterm cumulative revenue over a period of time under a budget constraint. We decompose the original problem into an easier bilevel optimization, which significantly reduces the solution space. For the lowerlevel optimization, we derive an optimal reward function with theoretical guarantees and design an action space reduction technique to improve the sample efficiency. Extensive offline experimental analysis and online A/B testing demonstrate the superior performance of our MSBCB over the stateoftheart baselines in terms of cumulative revenue.
Acknowledgements
The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214), the Special Program of Artificial Intelligence and the Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission (No.: 569 17ZXRGGX00150) and the Alibaba Group through Alibaba Innovative Research Program. We deeply appreciate all teammates from Alibaba group for the significant supports for the online experiments.
References
 Altman (1999) Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999.
 Åström & Hägglund (1995) Åström, K. J. and Hägglund, T. PID controllers: theory, design, and tuning, volume 2. Instrument society of America Research Triangle Park, NC, 1995.
 Boutilier & Lu (2016) Boutilier, C. and Lu, T. Budget allocation using weakly coupled, constrained markov decision processes. 2016.
 Cai et al. (2017) Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y., and Guo, D. Realtime bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 661–670. ACM, 2017.
 Dantzig (1957) Dantzig, G. B. Discretevariable extremum problems. Operations research, 5(2):266–288, 1957.
 Du et al. (2017) Du, M., Sassioui, R., Varisteas, G., Brorsson, M., Cherkaoui, O., et al. Improving realtime bidding using a constrained markov decision process. In International Conference on Advanced Data Mining and Applications, pp. 711–726. Springer, 2017.
 Du et al. (2019) Du, R., Zhong, Y., Nair, H., Cui, B., and Shou, R. Causally driven incremental multi touch attribution using a recurrent neural network. arXiv preprint arXiv:1902.00215, 2019.
 Edelman et al. (2007) Edelman, B., Ostrovsky, M., and Schwarz, M. Internet advertising and the generalized secondprice auction: Selling billions of dollars worth of keywords. American economic review, 97(1):242–259, 2007.
 Ie et al. (2019) Ie, E., Jain, V., Wang, J., Navrekar, S., Agarwal, R., Wu, R., Cheng, H.T., Lustman, M., Gatto, V., Covington, P., et al. Reinforcement learning for slatebased recommender systems: A tractable decomposition and practical methodology. arXiv preprint arXiv:1905.12767, 2019.
 Ji & Wang (2017) Ji, W. and Wang, X. Additional multitouch attribution for online advertising. In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.
 Jin et al. (2018) Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. Realtime bidding with multiagent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193–2201. ACM, 2018.

Li et al. (2018)
Li, P., Hawbani, A., et al.
An efficient budget allocation algorithm for multichannel
advertising.
In
2018 24th International Conference on Pattern Recognition (ICPR)
, pp. 886–891. IEEE, 2018.  Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 Martello et al. (1999) Martello, S., Pisinger, D., and Toth, P. Dynamic programming and strong bounds for the 01 knapsack problem. Management Science, 45(3):414–424, 1999.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 Nuara et al. (2019) Nuara, A., Sosio, N., TrovÃ, F., Zaccardi, M. C., Gatti, N., and Restelli, M. Dealing with interdependencies and uncertainty in multichannel advertising campaigns optimization. In The World Wide Web Conference, pp. 1376–1386. ACM, 2019.
 Ren et al. (2017) Ren, K., Zhang, W., Chang, K., Rong, Y., Yu, Y., and Wang, J. Bidding machine: Learning to bid for directly optimizing profits in display advertising. IEEE Transactions on Knowledge and Data Engineering, 30(4):645–659, 2017.
 Ren et al. (2018) Ren, K., Fang, Y., Zhang, W., Liu, S., Li, J., Zhang, Y., Yu, Y., and Wang, J. Learning multitouch conversion attribution with dualattention mechanisms for online advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1433–1442. ACM, 2018.
 Ren et al. (2019) Ren, K., Qin, J., Zheng, L., Yang, Z., Zhang, W., and Yu, Y. Deep landscape forecasting for realtime bidding advertising. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 363–372. ACM, 2019.
 Roberge (2015) Roberge, M. The Sales Acceleration Formula: Using Data, Technology, and Inbound Selling to go from 100 Million. John Wiley & Sons, 2015.
 Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
 Wu et al. (2018) Wu, D., Chen, X., Yang, X., Wang, H., Tan, Q., Zhang, X., Xu, J., and Gai, K. Budget constrained bidding by modelfree reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1443–1451. ACM, 2018.
 Yuan et al. (2013) Yuan, S., Wang, J., and Zhao, X. Realtime bidding for online advertising: measurement and analysis. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising, pp. 1–8, 2013.
 Zhang et al. (2014) Zhang, W., Yuan, S., and Wang, J. Optimal realtime bidding for display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1077–1086. ACM, 2014.
 Zhang et al. (2016) Zhang, W., Ren, K., and Wang, J. Optimal realtime bidding frameworks discussion. arXiv preprint arXiv:1602.01007, 2016.
 Zhou et al. (2018) Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. ACM, 2018.
 Zhu et al. (2017) Zhu, H., Jin, J., Tan, C., Pan, F., Zeng, Y., Li, H., and Gai, K. Optimized cost per click in taobao display advertising. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2191–2200. ACM, 2017.
Appendix A Background of Online Advertising
Online advertising is a marketing strategy involving the use of advertising platform as a medium to obtain website traffics and targets, and deliver marketing messages of advertisers to the suitable customers.
Platform. Advertising platform plays an important role in connecting consumers and advertisers. For consumers, it provides multiple advertising channels, e.g., channels on news media, social media, Ecommerce websites and apps to explore. For advertisers, it provides automated bidding strategies to compete for consumers in all channels under the setting of realtime bidding (RTB), in which advertisers bid for ad exposures and the exposures opportunities go to the highest bidder with a cost which equals to the secondhighest bid in the auction.
Consumers. Consumers explore multiple channels during the several visits to the platform within a couple of days. A consumer’s final purchase of an item is usually a gradually changing process, which often includes the phases of Awareness, Interest, Desire, and Action (AIDA) (Roberge, 2015). The consumer’s decision to convert (purchase a product) is usually and has to be driven by multiple touchpoints (exposures) with ads. Each advertising exposure during the sequentially multiple interactions could influence the consumer’s mind (preferences and interests) and therefore contribute to the final conversion.
Advertisers. The goal of advertisers is to cultivate the consumer’s awareness, interest and finally driving purchase. As different ad strategies can affect consumers’ AIDA, an advertiser should develop a competitive strategy to win the ad exposures in RTB setting. When the ad is displayed to a consumer, in Cost Per Click (CPC) setting, the advertisers should pay commission to the platform after the consumer clicking the ad. When the consumer purchases the advertised item, the advertiser will get the corresponding revenue.
The objective of an advertiser is usually to optimize the accumulated revenue within a time period under a budget constraint. A strategy that maximizes shortterm revenue of each ad exposure on different channels independently is obviously unreasonable, since the final purchase is a result of longterm adconsumer sequential interactions and the consumer’s visits between different channels are interdependent. Therefore, the advertiser must develop a strategy to overcome following two key challenges: (1) Find the optimal interaction sequence including interaction times, channels selection and channel orders for a targeted consumer; (2) Choose targeted consumers and allocate predefined limited budget to them in multiple interaction sequences.
An example of a user’s shopping journey is shown in Figure 9. At time , a user visits the news media channel and triggers an advertising exposure opportunity. Then, the advertising agent executes a display action and leaves an exposure on the user. After that, the user becomes aware of and is interested in the commodity, so he clicks the hyperlink. Quickly, the user is induced into the landing (detail) page of the commodity in the shopping app. After fully understanding the product information, the user leaves the shopping app. After a period of time, the user comes back to the shopping app at time and triggers an exposure opportunity of banner advertising. The advertising agent executes a display action as well. Consequently, the user’s desire is stimulated. At time , the user makes a purchase. In this example, the ad exposure at time influences the user’s mind and contributes to the ad exposure at time and the delayed purchase, which means the ad exposure on one channel would influence the user’s preferences and interests, and therefore contributes to the final conversion. Thus, the goal of advertising should maximize the total cumulative revenue over a period of time instead of simply maximizing the immediate revenue.
Appendix B Proof and Analysis
b.1 Knapsack Problem in Online Advertising Settings
Theorem 2. The greedy solution to the proposed dynamic knapsack problem of online advertising is approximately optimal where
Proof. In the proposed online advertising problem, each user is with value (i.e. the profit of advertiser when the user purchase the commodity) and weight (i.e. the total budget consumption for the target user in the realtime bidding to reach the final purchase). As the item (i.e. user) is nonsplittable, the proposed dynamic knapsack problem is essentially a 01 knapsack problem which aims to maximize the total value of the knapsack given a fixed capacity . For each item, we can calculate the CostPerformance Ratio (CPR) as . Sort all items in descending order of CPR, i.e. where . For , and , we first define that this 01 knapsack problem has optimal solution and greedy solution where and represent the total value of the knapsack.
Assume is the remaining budget after greedy algorithm, the following inequality holds:
(13) 
This is because:

If the knapsack can hold all the items after the greedy algorithm, that is, the optimal solution is equal to the greedy solution. As , we have

If the knapsack cannot hold all the items after the greedy algorithm, as , we have where is the index of last item picked by greedy algorithm. This derivation can be simplified to .
In online advertising settings, the budget spent on a single user is much smaller than the advertiser’s total budget. We conduct statistics on one of the world’s largest Ecommerce platforms to prove it. On Feb 3rd of 2020, a total of 1136149 ads result in 983414548 userad sequences (a user sequence consists of multiple interactions of the same user with the same ad), with an average of 865 user sequences per ad. Interactions with users of each ad forms a knapsack problem, where each user sequence is an item in the knapsack. The average maximum budget consumed by each user sequence accounts for 0.07068% of the total budget capacity of the advertisers. We also list details of 5 ads with largest budget consumption in Table 4, where the maximum budget consumed by each user sequence is much smaller than 1/1000 (smaller than 3/10000 specifically) of the total budget of each ad.
Ad  #Users Sequences  Budget  Avg Cost  (Avg Cost)/Budget  Max Cost  (Max Cost)/Budget 

Ad 1  2460976  119352.51  0.048498039  0.0000406343%  20.04  0.0167905979% 
Ad 2  2674738  114388.54  0.04276626  0.000037388%  26.22  0.0229218766% 
Ad 3  2848816  90113.08  0.031631766  0.0000351023%  15.29  0.0169675701% 
Ad 4  2107497  82951.82  0.03936035  0.0000474497%  5.6  0.0067509067% 
Ad 5  1087011  77140.49  0.070965694  0.0000919954%  19.32  0.0250452130% 
As proposed in Dantzig (1957), , the greedy algorithm achieves an approximation guarantee of . We can conclude from above statistics that , which means is much greater than .
The thesis above can be further proved:

If the knapsack can hold all the items after the greedy algorithm, that is, the greedy solution is obviously equal to the optimal solution, which is also the approximately optimal solution.

If the knapsack cannot hold all the items after the greedy algorithm, we have , that is, . According to Formula 13, we have
(14)
Therefore, in theory, the greedy solution in our online advertising settings is approximately optimal and the is much greater than 99.9% in our case.
b.2 Regretless Optimal Bidding Strategy
Theorem 3. During the online bidding phase, the bidding agent can always set the bid price as:
(15)  
where . is a regretless optimal bidding strategy without any loss of accuracy.
Proof. Since is unknown until the current auction is finished, we prove the regretless of from the following two cases:

If : , which means the agent should take action in this case. Exactly, is greater than the second highest price based on the condition for entering the current branch. Thus, the agent will always win the auction and the executed action is indeed .

If : , which means the agent should take action in this case. Exactly, is less than the second highest price according to the condition. Thus, the agent will always lose the auction and the executed action is indeed .
Thus, we complete the proof.
b.3 Convergence Analysis of Msbcb
The overall framework of MSBCB can be described as follows:

Let the budget constraint of an advertiser be . Given a , we can use reinforcement learning algorithms to ensure that each user is optimized according to and converges to the optimal policy under the current . Further, picking all users whose will result in a total cost of (i.e., the advertiser spends a budget ).

As the current estimated threshold might have some bias from the optimal , may not equal to the budget . Thus, we design a PID controller to dynamically adjust the estimated so as to minimize the gap between the budget constraint and the actual feedback of the daily cost .
As described in Figure 10, MSBCB repeats the above two steps iteratively. Given an updated , each will be optimized by the lowerlevel reinforcement learning algorithms and will move towards the optimal . As a result, users whose optimized will be selected and we get the daily cost . Then, the current will be updated so that the gap between the cost and the budget will be further minimized. Thus, will move towards the optimal gradually. As long as the learning rates of and are small enough, the overall iterations will finally converge. In this paper, we also validate the convergence of our MSBCB in the experiments. As shown in Section 4.2 of the paper, our method converges quickly and finally reaches an approximation ratio of 98.53%.
Appendix C Deployment
Here we give the online deployment details of our MSBCB.
c.1 Myopic to NonMyopic Advertising System Upgrade Solution
A myopic advertising system includes several key components as Figure 11 shows: (1) Log module collects auction information and user feedback. (2) Training data are constructed based on log followed by model training with offline evaluation. (3) Realtime prediction (RTP) module provides service for myopic value prediction of userad pairs. RTP periodically pulls newly trained models. (4) Merger module receives the user visit, requests RTP for myopic value with which ad bid adjustment ratios and ranking scores are calculated (In advertising, ranking score is where is predicted Click Through Rate and is the bidding price). Finally, topscored ads are delivered to the user. Above myopic advertising system can upgrade to a nonmyopic system by considering the following key changes. (1) Log module needs to keep longterm auction information and users’ feedback, and these data are used to construct features and longterm labels for training. Besides, logged data have to track each advertised item’s budget and current cost data which are fed to a PID control module to compute for users selection in Merger. (2) Model training can use Monte Carlo (MC) or Temporal Difference (TD) methods. For MC, the longterm labels are cumulative rewards of a sequence and the training becomes a supervised regression problem. For TD, onestep or multistep rewards are used to compute a bootstrapped longterm value using a separate network for training. (3) RTP module should periodically pull both myopic and nonmyopic newly trained models and provide corresponding value prediction service. (4) Merger maintains an table which is updated periodically from PID module. When a user visit comes, Merger requests RTP for both and longterm values (longterm i.e. and i.e. in our paper), and with decides the selection of current user and bid adjustment.