Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising

by   Xiaotian Hao, et al.

In E-commerce, advertising is essential for merchants to reach their target users. The typical objective is to maximize the advertiser's cumulative revenue over a period of time under a budget constraint. In real applications, an advertisement (ad) usually needs to be exposed to the same user multiple times until the user finally contributes revenue (e.g., places an order). However, existing advertising systems mainly focus on the immediate revenue with single ad exposures, ignoring the contribution of each exposure to the final conversion, thus usually falls into suboptimal solutions. In this paper, we formulate the sequential advertising strategy optimization as a dynamic knapsack problem. We propose a theoretically guaranteed bilevel optimization framework, which significantly reduces the solution space of the original optimization space while ensuring the solution quality. To improve the exploration efficiency of reinforcement learning, we also devise an effective action space reduction approach. Extensive offline and online experiments show the superior performance of our approaches over state-of-the-art baselines in terms of cumulative revenue.


page 10

page 13

page 15

page 17

page 18

page 19

page 20

page 21


Learning to Advertise with Adaptive Exposure via Constrained Two-Level Reinforcement Learning

For online advertising in e-commerce, the traditional problem is to assi...

Optimal Delivery with Budget Constraint in E-Commerce Advertising

Online advertising in E-commerce platforms provides sellers an opportuni...

Search and Score-Based Waterfall Auction Optimization

Online advertising is a major source of income for many online companies...

Learning to Infer User Hidden States for Online Sequential Advertising

To drive purchase in online advertising, it is of the advertiser's great...

Mixture of Virtual-Kernel Experts for Multi-Objective User Profile Modeling

In many industrial applications like online advertising and recommendati...

DCAF: A Dynamic Computation Allocation Framework for Online Serving System

Modern large-scale systems such as recommender system and online adverti...

Efficient Optimal Selection for Composited Advertising Creatives with Tree Structure

Ad creatives are one of the prominent mediums for online e-commerce adve...

1 Introduction

In E-commerce, online advertising plays an essential role for merchants to reach their target users, in which Real-time Bidding (RTB) (Zhang et al., 2014, 2016; Zhu et al., 2017) is an important mechanism. In RTB, each advertiser is allowed to bid for every individual ad impression opportunity. Within a period of time, there are a number of impression opportunities (user requests) arriving sequentially. For each impression, each advertiser offers a bid based on the impression value (e.g., revenue) and competes with other bidders in real-time. The advertiser with the highest bid wins the auction and thus display ad and enjoys the impression value. Displaying an ad also associates with a cost: in Generalized Second-Price (GSP) Auction (Edelman et al., 2007), the winner is charged for fees according to the second highest bid. The typical advertising objective for an advertiser is to maximize its cumulative revenue of winning impressions over a time period under a fixed budget constraint.

Figure 1: An illustration of the sequential multiple interactions (across different channels) between a user and an ad. Each ad exposure has long-term influence on the user’s final purchase decision.

In a digital age, to drive conversion, advertisers can reach and influence users across various channels such as display ad, social ad, paid search ad (Ren et al., 2018). As illustrated in Figure 9, the user’s decision to convert (purchase a product) is usually driven by multiple interactions with ads. Each ad exposure would influence the user’s preferences and interests, and therefore contributes to the final conversion. However, existing advertising systems (Yuan et al., 2013; Zhang et al., 2014; Ren et al., 2017; Zhu et al., 2017; Jin et al., 2018; Ren et al., 2019) mainly focus on maximizing the single-step revenue, while ignoring the contribution of previous exposure to the final conversion, and thus usually falls into suboptimal solutions. The reason is that simply optimizing the total immediate revenue cannot guarantee the maximazation of long-term cumulative revenue. Besides, there exist some works (Boutilier & Lu, 2016; Du et al., 2017; Cai et al., 2017; Wu et al., 2018)

which optimize the overall revenue under an extra-long (billions) request sequence using a single Constrained Markov Decision Process (CMDP)

(Altman, 1999). However, the optimization of these methods above is myopic as they ignore the mental evolution of each user and long-term advertising effects. The learning is particularly inefficient as well.

Apart from the myopic approaches, there exists some literatures considering the long-term effect of each ad exposure. Multi-touch attribution (MTA) (Ji & Wang, 2017; Ren et al., 2018; Du et al., 2019) study the credits assignment to the previous ad displays before conversion. However, these methods only attend to figure out the contribution of each ad exposure, while not providing methods to optimize the strategies. Besides, since all media channels could affect users’ conversions, Li et al. (2018); Nuara et al. (2019) propose multi-channel budget allocation algorithms to help advertisers understand how particular channels contribute to user conversions. They optimize the budget allocation among all channels accordingly to maximize the overall revenue. However, the granularity of their optimizations is too coarse. They only optimize the budget allocation in the channel level and do not specifically optimize the advertising sequence for each user, which could lead to suboptimal overall performance.

Considering the shortcomings of existing works, we aim at optimizing the budget allocation of an advertiser among all users such that the cumulative revenue of the advertiser could be maximized, by explicitly taking into consideration the long-term influence of ad exposures to individual users. This problem consists of two levels of coupled optimization: bidding strategy learning for each user and budget allocation among users, which we termed as Dynamic Knapsack Problem. Different from traditional Knapsack problem, a number of challenges arise: 1) Given the estimated long-term value and cost for each user, the optimization space of the budget allocation grows exponentially in the number of users. Besides, since different advertising policies for each user will lead to different long-term values and costs, the overall optimization space is extremely large. 2) The long-term cumulative value and cost for each user are unknown, which are difficult to make accurate estimations.

To address the above challenges, we propose a novel bilevel optimization framework: Multi-channel Sequential Budget Constrained Bidding (MSBCB), which transforms the original bilevel optimization problem into an equivalent two-level optimization with significantly reduced searching space. The higher-level only needs to optimize over one dimensional variable and the lower-level learns the optimal bidding policy for each user and computes the corresponding optimal budget allocation solution. For the lower-level, we derive an optimal reward function with theoretical guarantee. Besides, we also propose an action space reduction approach to significantly increase the learning efficiency of the lower-level. Finally, extensive offline analyses and online A/B testing conducted on one of the world’s largest E-commerce platforms, Taobao, show the superior performance of our algorithm over state-of-the-art baselines.

2 Formulation: Dynamic Knapsack Problem

Within a time period of days, we assume that there are users visiting the E-commerce platform. Each user may interact with the app multiple times and trigger multiple advertising requests. During the sequential interactions between an ad and a user, each ad exposure could influence the user’s mind and therefore contributes to the final conversion. Given a fixed ad, for each user , we build a separate Markov Decision Process (MDP) (Sutton & Barto, 2018) to model its sequential interactions with the same ad. We use to denote the advertising policy of the ad towards user , which takes user ’s state as input and outputs the auction bid. Details of the MDP will be discussed in Section 3.2. For the fixed ad, we define and as the expected long-term cumulative value and cost for each user under policy . Formally,


where and represent the value (i.e., the revenue) and cost obtained from each request according to policy , and represent the long-term cumulative value and cumulative cost, is the length of the interaction sequence between user and the current ad.

Given the above definitions, for an advertiser, our target is to maximize its long-term cumulative revenue over days under a budget constraint , which is formulated as:


where , , and indicates whether the user is selected. Since whether displaying an ad to user does not have any impact on user ’s behaviors, , and among different users are independent. Thus, given any fixed advertising policy , and for each user are fixed and the inner optimization of Equation (2) can be viewed as a classic knapsack problem. The items to be put into the knapsack is the users. However, different advertising policies would lead to different s and s for each user, thus here we define Equation (2) as a Dynamic Knapsack Problem where the value and cost of each item in the knapsack are dynamic. From the perspective of optimization, Formulation (2) is a typical bilevel optimization, where the optimization of is embedded (nested) within the optimization of . This bilevel optimization is challenging due to the following reasons:

  1. The optimization space of the joint is continuous (for the bid space is continuous). The optimization space of is discrete, which grows exponentially in the number of users (hundreds of millions). Therefore, the solution space of the combination of and is enormous and thus is difficult or even impossible to optimize directly.

  2. The value of and are unknown and variable, efficient approaches are required to estimate these values online under limited samples.

3 Methodology: MSBCB Framework

3.1 Bilevel Decomposition and Proof of Correctness

Based on the above analysis, the bilevel optimization (2) is computationally prohibitive and cannot be solved directly. In this paper, we first decompose it into an equivalent two-level sequential optimization process. When taking a fixed policy as input, we denote the optimal solution of the degraded and static Knapsack Problem as . Further, the global optimal solution of Problem (2) could be defined as:


where are independent variables and is the global optimal solution. To obtain , we must firstly specify the form of the function .

When taking a fixed policy as input, computing is a classic static knapsack problem. However, another challenge in online advertising is that the user requests are arriving sequentially in real time and thus real-time decision makings are required. Complicated algorithms (e.g. dynamic programming) are not applicable due to the incompleteness of all users’ values and costs.

On the contrary, the Greedy algorithm could compute a greedy solution without completely knowing the whole set of candidate users beforehand. We will discuss this latter. Besides, the Greedy algorithm can achieve nearly optimal solution in the online advertising (Zhang et al., 2014; Wu et al., 2018). As proved by Dantzig (1957), if , , i.e., the cumulative cost for each user is much less than the budget, the Greedy algorithm achieves an approximation ratio of , which means the greedy solution is at least times of the optimal solution . The closer the gets to 1, the higher the quality of the greedy solution will be. In online advertising, is usually greater than 99.9%. Thus, the greedy solution is approximately optimal. We provide the detailed data and proof in Section B.1 of the Appendix. Therefore, in this paper, we refer to the Greedy algorithm, i.e., .

We define as the Cost-Performance Ratio of each user . The greedy solution is computed by:

  1. Sorting all users according to the Cost-Performance Ratio in a descending order;

  2. Pick users from top to bottom until the cumulative cost violates the budget constraint.

Figure 2: The solution computing process of the Greedy algorithm.

An illustration is shown in Figure 2. In this example, the budget constraint . We denote the of the last picked user as , the threshold of the cost-performance ratio. In this example, the . The advantage is that the Greedy algorithm only selects users whose . If we could estimate the beforehand, the Greedy algorithm could compute the solution online, without completely knowing the values and costs of all users.

Now that and the Greedy algorithm prefers users with larger (only pick users whose ), according to Equation 3, to further improve the solution quality, an intuitive way is to optimize for each user such that each could be maximized, i.e., . However, this intuition is incorrect. Maximizing the of each user cannot guarantee that the greedy solution could be maximized. Next, we show that given all users’ CPRs are maximized, we can still further improve the solution quality by increasing certain users’ allocated budgets and decreasing their CPRs in exchange for greater overall cumulative value. Before we go into the details, we firstly give Lemma 1.

Figure 3: is monotonic with .

Lemma 1. For each user , the cumulative value increases monotonically with the increase of cost within the range of all possible optimal policies .

Proof. We assume that the maximum budget allocated to each user as , where is the maximum cost user can consume. Then, for each user , within the current budget constraint , the optimal advertising policy must be the one which could maximize the cumulative value, i.e., . Obviously, as moves from to , we will get a set of optimal policies , whose cost and value are both increasing. An illustration is shown in Figure 3. Thus we complete the proof.

Figure 4: The x-axis denotes each user’s and y-axis denotes the cumulative cost of the Greedy algorithm. All users are sorted in descending order and arranged from bottom to top. Each rectangular slice’s area (in gray) represents , where and are the width and height. Note that, the height of each rectangular slice is much less than the budget constraint, i.e., . The red dashed line marks the position of the budget constraint. The total area of all rectangular slices under the red dashed line constitutes the greedy solution.

As illustrated in Figure 4, each user’s (the width of each rectangular slice) is maximized initially. According to Lemma 1, for a user , if we increase by , i.e., increase the height of user by , the corresponding will also increase. We denote this increase in value as . Since there is a budget limit, a small increased height will squeeze out a small area nearby the , whose height is also and width is 111Since , the area squeezed out could be considered as a tiny and smooth change and the width of the last user is approximately equal to . We denote the increased area by reshaping user as and the decreased area due to extrusion as . Overall, if , the total area will be further increased. For any user , yields:


where and are caused by the change of , e.g., from to . We denote as and as . We conclude that the greedy solution can be further improved if there exists any user whose current policy can be further improved to such that . Otherwise, the current solution is optimal. Finally, we provide the definition of the optimal in Theorem 1.

Theorem 1. Under the Greedy paradigm (), for any given , the optimal advertising policy for each user is the one which could maximize . In other words, is defined as:


We denote . The corresponding solution is the optimal Greedy solution of the Dynamic Knapsack Problem defined in Equation (2).

Proof of Theorem 1. We define , where is defined according to Equation (5), . We prove Theorem 1 by contradiction. Given the threshold , we firstly assume that is not the optimal greedy solution of the Dynamic Knapsack Problem, which means we could at least find a user , whose policy could be further improved to policy such that the overall area is increased. This means we could find a better policy for user such that according to Equation (4), where and ( increases monotonically with the increase of according to Lemma 1). Further, yields:


Equation (6) indicates that

which contradicts the definition of in Equation (5). Thus, the theorem statement is obtained.

1:  Input: an initial ;
2:  Output: optimal greedy solution of the Dynamic Knapsack Problem;
3:  for each period until convergence do
4:     Taking the current estimated as input, the agent optimizes the advertising policy for each user according to Section 3.2 and acquires the optimal .
5:     Based on the current estimated and the obtained , the agent calculates the greedy solution according to Section 3.3 and collects the actual feedback cost and the predefined budget.
6:     Update the estimated towards by minimizing the gap between the actual feedback cost and the budget according to Section 3.4.
7:  end for
Algorithm 1 MSBCB Framework.

We present the overall MSBCB framework in Algorithm 1, which involves a two-level sequential optimization process. (1) Lower-level: Given any , we could obtain the optimal advertising policy following Equation 5 of Theorem 1, which will be discussed in Section 3.2. Then, based on and the optimized , we could acquire the Greedy solution by selecting users whose , which will be detailed in Section 3.3. (2) Higher-level: However, the current might , which means selecting all users whose might violate the budget constraint or lead to a substantial budget surplus. Thus, we optimize the current towards in Section 3.4. Overall, the optimization space of is reduced from to a one-dimensional continuous variable . We conclude that Algorithm 1 could iteratively converge to a unique and approximate optimal solution. We present the proof of convergence in Section B.3 of the Appendix.

3.2 Lower-level Advertising Policy Optimization with Reinforcement Learning

Given a threshold as input, we aim to acquire the optimal advertising policy defined in Equation (5) of Theorem 1. Combining the definitions of and with Equation (5), we have


Accordingly, we define , i.e., , as the immediate profit acquired at each step . The objective of Equation (7) is to obtain the optimal advertising policy which could maximize the expected long-term cumulative profit. To solve this sequential decision making problem, we formulate it as an MDP and use Reinforcement Learning (RL) (Sutton & Barto, 2018) techniques to acquire the optimal policy .

We consider an episodic MDP, where an episode starts with the first interaction between a user and an ad, and ends up with a purchase or exceeding the maximum step as:

  • State : The state should in principle reflect the user request status, ad info, user-ad interaction history info and the RTB environment.

  • Action : The action each agent can take in the RTB platform is the bid, which is a real number between 0 and the upper bound , i.e., .

  • Reward : The immediate reward at step is defined as .

  • Transition probability

    : Transition probability is defined as the probability of state transitioning from

    to when taking action .

  • Discount factor : The bidding agent aims to maximize the total discounted reward from step onwards, where .

For each user , we define the state-action value function as the expected cumulative reward achieved by following the advertising policy . The MDP can be solved using existing Deep Reinforcement Learning (DRL) algorithms such as DQN (Mnih et al., 2013), DDPG (Lillicrap et al., 2015) and PPO (Schulman et al., 2017). After sufficient training, we would acquire the optimized advertising policies for all users.

3.3 Lower-level User Selection by Greedy Algorithm

Taking the current and the optimized advertising policies as inputs, we aim to obtain the greedy solution of the Dynamic Knapsack Problem. In reality, we cannot know all users’ request sequences and their values and costs beforehand because the user requests are arriving sequentially in real time. Thus, many complicated methods depending on the completeness of all users’ data, e.g., the dynamic programming approach (Martello et al., 1999), are not applicable. Even the traditional Greedy algorithm cannot be applied either. Fortunately, the greedy solution could be computed online in an easy way: given the threshold , the agent only has to select users online whose CPRs are greater than the threshold (an illustration is shown in Figure 2). Therefore, we only have to estimate the for each user . To acquire and , besides Q(s,a), we also maintain two other state value functions and according to the Bellman Equation (Sutton & Barto, 2018), where and .

3.4 Higher-level Optimization by Feedback Control

However, the current estimated threshold might have some bias from the optimal . Thus, selecting all users whose might violate the budget constraint or lead to a substantial budget surplus. Only when the estimated is exactly the same with the optimal , the actual total advertising cost will be equal to the budget. To achieve this, we design a feedback control mechanism, i.e., a PID controller (Åström & Hägglund, 1995), to dynamically adjust the towards according to actual feedback of the overall cost. The core formula is:


where is the actual feedback cost of the current period, is the budget, and are the overall cost and the overall budget of the most recent periods. and are two learning rates. The main idea is when the actual cost exceeds (is less than) the budget, the threshold will be increased (decreased) accordingly such that less (more) users will be selected, which will reduce (increase) the cost in turn. The first term is designed to keep up with the latest changes. The second term is designed to stabilize learning.

3.5 Action Space Reduction for RL in Advertising

However, when applying the RL approaches mentioned in Section 3.2 to online advertising, one typical issue is that the sample utilization is inefficient. The main reason is that the action space of the agent is continuous, thus the range of needs to be fully explored in all states. To resolve this problem, we reduce the magnitude of the continuous action space (i.e., ) to a binary one (i.e., ) by making full use of the prior knowledge in advertising, which greatly improves the sample utilization of the RL approaches. Specifically, since different bids can only result in two different outcomes , where or 0 indicates whether the ad is displayed to the user, we only have to evaluate the different expected returns resulted by or for . We denote the greedy action based on the current value estimations as:


Then, to obtain an executable bid, for , we could offer a low enough bid, e.g., , to make sure that it is impossible to win the auction. For , we propose an optimal bid function which could output a bid greater than the second highest bid while not overbidding.

In detail, we maintain two state-action value functions and . Since the reward function is defined as , we have . Then yields:


If , the expected immediate cost is 0 (since the ad is not exposed). If , we denote the expected immediate cost as , whose value depends on the pricing model. In online advertising, typical pricing models includes CPM (Cost Per Mille, the advertiser bid for impressions and is charged based on impressions), CPC (Cost Per Click, the advertiser bid for clicks and is charged based on clicks) and CPS (Cost Per Sales, the advertiser bid for conversions and is charged based on conversions). If CPM is used, , where denotes the second highest bid in the auction. If CPC is used, , where pCTR represents the predicted Click-Through Rate. If CPS is used, , where pCVR represents the predicted Conversion Rate. For ease of presentation, we take CPM for an example. Under CPM,


Notice that the second highest bid is unknown until the current auction is finished. Substituting Equation (11) into Equation (10), we acquire


where . We denote the term on the right of the ’’ in Equation (12) as . And we conclude that the bidding agent can always set the bid price during the online bidding phase, which is the optimal action without any loss of accuracy. Refer to Section B.2 of the Appendix for proof. For CPC or CPS, the optimal bid formula can be easily acquired by substituting the corresponding into Equation 11. Here, we reaffirm that our action space reduction technique is a generalized design and is applicable to different pricing models.

4 Empirical Evaluation: Simulations

We start with designing simulation experiments to shed light on the contributions of the proposed framework MSBCB under more controlled settings. Similar to the simulation settings of (Ie et al., 2019), we assume there are a set of users , a set of ads and a set of commodity categories . Each ad has an associated category. Each user has various degrees of interests in commodity categories, which is influenced by the displayed ad. When user consumes ad , his interest in category is nudged stochastically, biased slightly towards increasing his interest, but allows some chance of decreasing his interest. We set , and in the following experiments. Detailed settings of the simulation environment can be found in Section D of the Appendix.

4.1 Baselines

We compare our MCBCB with following baseline strategies:

  • Myopic Approaches: (1) Manual Bid is a strategy that the agent continuously bids at the same price initialized by the advertiser. (2) Contextual Bandit (Zhang et al., 2014) aims at maximizing the accumulated short-term value of each request based on the Greedy framework.

  • Greedy with maximized CPR: This approach is similar to our method under the Greedy framework except that each is optimized by maximizing the long-term CPR. In the offline simulation, we enumerate all policies for each user and select the one which could maximize its CPR. This approach is named as Greedy+maxCPR.

  • Greedy with state-of-the-art RL approaches: These baselines, i.e., Greedy+DQN, Greedy+DDPG and Greedy+PPO, utilize the same reward function with our MSBCB to optimize the lower-level optimization of . The difference is that our MSBCB leverages the action space reduction technique. For DQN and PPO, we discretize the bid action space evenly into 11 real numbers as the valid actions.

  • Undecomposed Optimization: These baselines are RL approaches (DQN,DDPG and PPO) based on the Constrained Markov Decision Process (CMDP). They are named as Constrained+DQN, Constrained+DDPG, Constrained+PPO respectively. We follow the CMDP design and settings in (Wu et al., 2018).

  • Offline Optimal: The optimal solution of the Dynamic Knapsack Problem can be computed by dynamic programming in offline simulation because we could enumerate all possible policies to get the corresponding long-term values and costs for each user. Note that since users’ request sequences are unknown beforehand and there is only one chance for the ad to bid for each request in the online advertising systems, the optimal solution can only be obtained in offline simulation.

4.2 Experimental Results

We conduct extensive analysis of our MSBCB in the following 5 aspects. All approaches aim to maximize the advertiser’s cumulative revenue under a fixed budget constraint. All experimental results are averaged over 10 runs. The hyperparameters for each algorithm are set to the best we found after grid-search optimization.

Figure 5: Values comparisons (learning curves) of the myopic approaches with non-myopic approaches and the offline optimal.

Myopic vs Non-myopic. To show the benefits of upgrading the myopic advertising system into a farsighted one, we compare the cumulative revenue achieved by our MSBCB with two other myopic baselines. The learning curves and results are shown in Figure 5 and Table 1. We see that MSBCB outperforms the Manual Bid and the Contextual Bandit by a large margin, which indicates that taking account of the long-term effect of each ad exposure could significantly improve the cumulative advertising results.

MSBCB vs the Offline Optimal. In Figure 5, we also compare our MSBCB with the Offline Optimal, which is computed by a modified dynamic programming algorithm. We see that as the training continues, our MSBCB gradually achieves an approximately optimal solution. Detailed results are summarized in Table 1. Our MSBCB empirically achieves an approximation ratio of 98.53%(0.36%).

MSBCB vs Greedy with maximized CPR. As discussed in Section 3.1, under the Greedy framework, maximizing each user’s cannot guarantee that the greedy solution of the Dynamic Knapsack Problem (2) could be maximized. The optimal advertising policy for each user is given by Theorem 1. To experimentally verify the correctness of Theorem 1, we compare the cumulative revenue achieved by MSBCB and the Greedy with maximized CPR. As shown in Figure 6 and Table 1, MSBCB outperforms Greedy with maximized CPR and achieves a improvement.

Figure 6: Value comparisons of MSBCB with the Greedy with maximized CPR and the Greedy with state-of-the-art RL.

MSBCB vs Greedy with state-of-the-art RL approaches. Besides, to show the effectiveness of the action-space reduction proposed in Section 3.5, we compare MSBCB with the state-of-the-art DRL approaches under the Greedy framework. As shown in Figure 6 and Table 1, MSBCB outperforms Greedy+DQN, Greedy+DDPG and Greedy+PPO both in the cumulative revenue and the convergence speed, which shows that the action space reduction effectively improves the sample efficiency of RL approaches.

Figure 7: Values comparison (learning curves) of MSBCB and state-of-the-art CMDP based RL approaches.

Decomposed MSBCB vs Undecomposed optimization. Similar to (Wu et al., 2018), the undecomposed optimization baselines consider all users’ requests as a whole and model the budget allocations among all request as a CMDP. As shown in Figure 7 and Table 1, MSBCB outperforms the CMDP based RL approaches by a large margin. The reason of the poor performance in CMDP-based approaches is that these methods model all users’ requests as a whole sequence and thus the learning process is particularly inefficient. In contrast, our MSBCB decomposes the whole sequence optimization into an efficient two-level optimization process, thus can achieve better performance more easily.

Method Revenue Cost Revenue Impro Approximation Ratio
Manual Bid 38838.28 11995.10 -48.31% 43.5%
Contextual Bandit 75137.30 11995.46 0% 84.15%
Constrained + PPO 61890.92 11954.07 -17.6316.11% 69.3113.56%
Constrained + DDPG 74259.12 11996.12 -1.193.66% 83.173.08%
Constrained + DQN 70662.65 11881.12 -5.967.83% 79.146.59%
Greedy + maxCPR 83668.70 11914.12 11.352.84% 93.702.36%
Greedy + PPO 76970.35 11825.59 2.443.52% 86.202.93%
Greedy + DDPG 80424.69 11841.28 7.041.13% 90.070.92%
Greedy + DQN 84117.09 11794.24 11.954.96% 94.214.14%
MSBCB 87947.99 11957.57 17.950.42% 98.500.33%
MSBCB (enum) 89251.77 11988.36 18.78% 99.96%
Offline Optimal 89291.11 11999.23 18.84% 100.00%
Table 1: Cumulative values, costs, value improvements (over Contextual Bandit) and the approximation ratio of all approaches.

The complete comparisons of all approaches are shown in Table 1. The budget constraint is set to 12000 for all experiments. In Table 1,we also add an MSBCB (enum), which is the theoretical upper bound of our MSBCB. The difference between MSBCB (enum) and MSBCB is that: the MSBCB (enum) computes the optimal advertising policy for each user by enumerating all possible policies. Instead of utilizing the RL approach, MSBCB (enum) could find the one which maximizes . We see MSBCB (enum) is very close to the optimal solution and reaches an approximation ratio of 99.96%.

4.3 Effectiveness of Action Space Reduction

As shown in Table 2

, MSBCB achieves a revenue of 75000 in only 61 epochs, reducing more than 60% samples compared with the state-of-the-art RL baselines without using the action-space reduction technique. As for learning process, our MSBCB achieves the same revenue (80000) more than 10 times faster than the baselines, reducing more than 90% samples and finally reaches the highest revenue. Thus, with the action space reduction technique, our MSBCB could reach a higher performance with a faster speed and significantly improve the sample efficiency. More analysis of our MSBCB, e.g., the convergence of

and , and the hyperparameter settings of the offline experiments are shown in Section D of the Appendix.

Revenue 75000 80000 85000
Method #Epoch #Samples #Epoch #Samples #Epoch #Samples
Greedy+PPO 817 4183040 - - - -
Greedy+DDPG 154 788480 853 4362240 - -
Greedy+DQN 373 1909760 754 3855360 - -
MSBCB 61 312320 71 363520 104 532480
Table 2: The training epochs and the number of samples needed by different approaches when achieving the same revenue level.

5 Empirical Evaluation: Online A/B Testing

We deployed MSBCB on one of the world’s largest E-commerce platforms, Taobao. Our platform is authorized by the advertisers to dynamically adjust their bid prices for each user request according its value in the real-time auction. In the online experiments, we compare MSBCB with two models widely used in the industry.

  • Cross Entropy Method (CEM), which is a deployed production model, whose target is to optimize the immediate rewards. We consider CEM as the control group in the following evaluations.

  • Contextual Bandit, which has been explained in previous section and is reserved as a contrast test.

The experiment involves 135,858,118 users and 72,147 ad items from 186 advertisers. For fair comparison, we control the consumers and the advertisers involved in the A/B testing to be homogeneous. In detail, the 135,858,118 users are randomly and evenly divided into 3 groups. For users in group #1, all 186 advertisers adopt the CEM algorithm. For users in group #2, all 186 advertisers adopt the Contextual Bandit algorithm. For users in group #3, all 186 advertisers adopt our MSBCB. Table 3 summarises the effects of the Contextual Bandit and our MSBCB compared to the Cross Entropy Method from Dec.10 to Dec.20 in 2019. From Table 3, we see that our MSBCB achieves a +10.08% improvement in revenue and a +10.31% improvement in ROI with almost the same cost (-0.20%). The results indicate that upgrading the myopic advertising strategy into a farsighted one could significantly improves the cumulative revenue. Besides, as shown in Figure 8, the daily ROI improvement also demonstrates the effectiveness of our MSBCB compared with the Contextual Bandit.

Method Revenue Cost CVR #PV ROI
Contextual Bandit +0.91% -3.26% +4.78% +4.62% +4.31%
MSBCB +10.08% -0.20% +6.04% +15.37% +10.31%
Table 3: The overall performance comparisons of the A/B testing. CVR represents the Conversion Rate of the users. #PV represents the number of page views. means Return On Investment. (Notice that CEM is the control group and the improvements of Contextual Bandit and MSBCB are compared over CEM.)

Given that there are only 186 advertisers take part in our online experiment, one frequently asked question is“How does the MSBCB work across all ads?” Since 186 is relatively small compared with the total number of advertisers, their policy updates would not cause dramatic changes to the RTB environment. In other words, the RTB environment is still approximately stationary from a single-ad perspective. This setting also works well with our practical business model–-providing better service for VIP advertisers (about 0.2% of all the advertisers). In the case that the majority of the advertisers adopt MSBCB, the system cannot be estimated as being stationary from any single-ad’s perspective and explicit multi-agent modeling and coordination should be incorporated. Detailed analysis of the improvement in revenue for each advertiser is presented in Table 7 and Figure 19 of the Appendix. More details about the deployment and experimental results (e.g., the online model architecture) can also be found in Section C and E of the Appendix.

Figure 8: Daily ROI improvement comparisons of Contextual Bandit and MSBCB over Cross Entropy Method.

6 Conclusion

We formulate the multi-channel sequential advertising problem as a Dynamic Knapsack Problem, whose target is to maximize the long-term cumulative revenue over a period of time under a budget constraint. We decompose the original problem into an easier bilevel optimization, which significantly reduces the solution space. For the lower-level optimization, we derive an optimal reward function with theoretical guarantees and design an action space reduction technique to improve the sample efficiency. Extensive offline experimental analysis and online A/B testing demonstrate the superior performance of our MSBCB over the state-of-the-art baselines in terms of cumulative revenue.


The work is supported by the National Natural Science Foundation of China (Grant Nos.: 61702362, U1836214), the Special Program of Artificial Intelligence and the Special Program of Artificial Intelligence of Tianjin Municipal Science and Technology Commission (No.: 569 17ZXRGGX00150) and the Alibaba Group through Alibaba Innovative Research Program. We deeply appreciate all teammates from Alibaba group for the significant supports for the online experiments.


  • Altman (1999) Altman, E. Constrained Markov decision processes, volume 7. CRC Press, 1999.
  • Åström & Hägglund (1995) Åström, K. J. and Hägglund, T. PID controllers: theory, design, and tuning, volume 2. Instrument society of America Research Triangle Park, NC, 1995.
  • Boutilier & Lu (2016) Boutilier, C. and Lu, T. Budget allocation using weakly coupled, constrained markov decision processes. 2016.
  • Cai et al. (2017) Cai, H., Ren, K., Zhang, W., Malialis, K., Wang, J., Yu, Y., and Guo, D. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 661–670. ACM, 2017.
  • Dantzig (1957) Dantzig, G. B. Discrete-variable extremum problems. Operations research, 5(2):266–288, 1957.
  • Du et al. (2017) Du, M., Sassioui, R., Varisteas, G., Brorsson, M., Cherkaoui, O., et al. Improving real-time bidding using a constrained markov decision process. In International Conference on Advanced Data Mining and Applications, pp. 711–726. Springer, 2017.
  • Du et al. (2019) Du, R., Zhong, Y., Nair, H., Cui, B., and Shou, R. Causally driven incremental multi touch attribution using a recurrent neural network. arXiv preprint arXiv:1902.00215, 2019.
  • Edelman et al. (2007) Edelman, B., Ostrovsky, M., and Schwarz, M. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American economic review, 97(1):242–259, 2007.
  • Ie et al. (2019) Ie, E., Jain, V., Wang, J., Navrekar, S., Agarwal, R., Wu, R., Cheng, H.-T., Lustman, M., Gatto, V., Covington, P., et al. Reinforcement learning for slate-based recommender systems: A tractable decomposition and practical methodology. arXiv preprint arXiv:1905.12767, 2019.
  • Ji & Wang (2017) Ji, W. and Wang, X. Additional multi-touch attribution for online advertising. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • Jin et al. (2018) Jin, J., Song, C., Li, H., Gai, K., Wang, J., and Zhang, W. Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193–2201. ACM, 2018.
  • Li et al. (2018) Li, P., Hawbani, A., et al. An efficient budget allocation algorithm for multi-channel advertising. In

    2018 24th International Conference on Pattern Recognition (ICPR)

    , pp. 886–891. IEEE, 2018.
  • Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Martello et al. (1999) Martello, S., Pisinger, D., and Toth, P. Dynamic programming and strong bounds for the 0-1 knapsack problem. Management Science, 45(3):414–424, 1999.
  • Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Nuara et al. (2019) Nuara, A., Sosio, N., TrovÃ, F., Zaccardi, M. C., Gatti, N., and Restelli, M. Dealing with interdependencies and uncertainty in multi-channel advertising campaigns optimization. In The World Wide Web Conference, pp. 1376–1386. ACM, 2019.
  • Ren et al. (2017) Ren, K., Zhang, W., Chang, K., Rong, Y., Yu, Y., and Wang, J. Bidding machine: Learning to bid for directly optimizing profits in display advertising. IEEE Transactions on Knowledge and Data Engineering, 30(4):645–659, 2017.
  • Ren et al. (2018) Ren, K., Fang, Y., Zhang, W., Liu, S., Li, J., Zhang, Y., Yu, Y., and Wang, J. Learning multi-touch conversion attribution with dual-attention mechanisms for online advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1433–1442. ACM, 2018.
  • Ren et al. (2019) Ren, K., Qin, J., Zheng, L., Yang, Z., Zhang, W., and Yu, Y. Deep landscape forecasting for real-time bidding advertising. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 363–372. ACM, 2019.
  • Roberge (2015) Roberge, M. The Sales Acceleration Formula: Using Data, Technology, and Inbound Selling to go from 100 Million. John Wiley & Sons, 2015.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT press, 2018.
  • Wu et al. (2018) Wu, D., Chen, X., Yang, X., Wang, H., Tan, Q., Zhang, X., Xu, J., and Gai, K. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1443–1451. ACM, 2018.
  • Yuan et al. (2013) Yuan, S., Wang, J., and Zhao, X. Real-time bidding for online advertising: measurement and analysis. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising, pp. 1–8, 2013.
  • Zhang et al. (2014) Zhang, W., Yuan, S., and Wang, J. Optimal real-time bidding for display advertising. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1077–1086. ACM, 2014.
  • Zhang et al. (2016) Zhang, W., Ren, K., and Wang, J. Optimal real-time bidding frameworks discussion. arXiv preprint arXiv:1602.01007, 2016.
  • Zhou et al. (2018) Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., and Gai, K. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1059–1068. ACM, 2018.
  • Zhu et al. (2017) Zhu, H., Jin, J., Tan, C., Pan, F., Zeng, Y., Li, H., and Gai, K. Optimized cost per click in taobao display advertising. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2191–2200. ACM, 2017.

Appendix A Background of Online Advertising

Online advertising is a marketing strategy involving the use of advertising platform as a medium to obtain website traffics and targets, and deliver marketing messages of advertisers to the suitable customers.

Platform. Advertising platform plays an important role in connecting consumers and advertisers. For consumers, it provides multiple advertising channels, e.g., channels on news media, social media, E-commerce websites and apps to explore. For advertisers, it provides automated bidding strategies to compete for consumers in all channels under the setting of real-time bidding (RTB), in which advertisers bid for ad exposures and the exposures opportunities go to the highest bidder with a cost which equals to the second-highest bid in the auction.

Consumers. Consumers explore multiple channels during the several visits to the platform within a couple of days. A consumer’s final purchase of an item is usually a gradually changing process, which often includes the phases of Awareness, Interest, Desire, and Action (AIDA) (Roberge, 2015). The consumer’s decision to convert (purchase a product) is usually and has to be driven by multiple touchpoints (exposures) with ads. Each advertising exposure during the sequentially multiple interactions could influence the consumer’s mind (preferences and interests) and therefore contribute to the final conversion.

Advertisers. The goal of advertisers is to cultivate the consumer’s awareness, interest and finally driving purchase. As different ad strategies can affect consumers’ AIDA, an advertiser should develop a competitive strategy to win the ad exposures in RTB setting. When the ad is displayed to a consumer, in Cost Per Click (CPC) setting, the advertisers should pay commission to the platform after the consumer clicking the ad. When the consumer purchases the advertised item, the advertiser will get the corresponding revenue.

The objective of an advertiser is usually to optimize the accumulated revenue within a time period under a budget constraint. A strategy that maximizes short-term revenue of each ad exposure on different channels independently is obviously unreasonable, since the final purchase is a result of long-term ad-consumer sequential interactions and the consumer’s visits between different channels are interdependent. Therefore, the advertiser must develop a strategy to overcome following two key challenges: (1) Find the optimal interaction sequence including interaction times, channels selection and channel orders for a targeted consumer; (2) Choose targeted consumers and allocate predefined limited budget to them in multiple interaction sequences.

Figure 9: An illustration of the sequential multiple interactions (across different channels) between a user and an ad. Each ad exposure has long-term influence on the user’s final purchase decision.

An example of a user’s shopping journey is shown in Figure 9. At time , a user visits the news media channel and triggers an advertising exposure opportunity. Then, the advertising agent executes a display action and leaves an exposure on the user. After that, the user becomes aware of and is interested in the commodity, so he clicks the hyperlink. Quickly, the user is induced into the landing (detail) page of the commodity in the shopping app. After fully understanding the product information, the user leaves the shopping app. After a period of time, the user comes back to the shopping app at time and triggers an exposure opportunity of banner advertising. The advertising agent executes a display action as well. Consequently, the user’s desire is stimulated. At time , the user makes a purchase. In this example, the ad exposure at time influences the user’s mind and contributes to the ad exposure at time and the delayed purchase, which means the ad exposure on one channel would influence the user’s preferences and interests, and therefore contributes to the final conversion. Thus, the goal of advertising should maximize the total cumulative revenue over a period of time instead of simply maximizing the immediate revenue.

Appendix B Proof and Analysis

b.1 Knapsack Problem in Online Advertising Settings

Theorem 2. The greedy solution to the proposed dynamic knapsack problem of online advertising is approximately optimal where

Proof. In the proposed online advertising problem, each user is with value (i.e. the profit of advertiser when the user purchase the commodity) and weight (i.e. the total budget consumption for the target user in the real-time bidding to reach the final purchase). As the item (i.e. user) is non-splittable, the proposed dynamic knapsack problem is essentially a 0-1 knapsack problem which aims to maximize the total value of the knapsack given a fixed capacity . For each item, we can calculate the Cost-Performance Ratio (CPR) as . Sort all items in descending order of CPR, i.e. where . For , and , we first define that this 0-1 knapsack problem has optimal solution and greedy solution where and represent the total value of the knapsack.

Assume is the remaining budget after greedy algorithm, the following inequality holds:


This is because:

  1. If the knapsack can hold all the items after the greedy algorithm, that is, the optimal solution is equal to the greedy solution. As , we have

  2. If the knapsack cannot hold all the items after the greedy algorithm, as , we have where is the index of last item picked by greedy algorithm. This derivation can be simplified to .

In online advertising settings, the budget spent on a single user is much smaller than the advertiser’s total budget. We conduct statistics on one of the world’s largest E-commerce platforms to prove it. On Feb 3rd of 2020, a total of 1136149 ads result in 983414548 user-ad sequences (a user sequence consists of multiple interactions of the same user with the same ad), with an average of 865 user sequences per ad. Interactions with users of each ad forms a knapsack problem, where each user sequence is an item in the knapsack. The average maximum budget consumed by each user sequence accounts for 0.07068% of the total budget capacity of the advertisers. We also list details of 5 ads with largest budget consumption in Table 4, where the maximum budget consumed by each user sequence is much smaller than 1/1000 (smaller than 3/10000 specifically) of the total budget of each ad.

Ad #Users Sequences Budget Avg Cost (Avg Cost)/Budget Max Cost (Max Cost)/Budget
Ad 1 2460976 119352.51 0.048498039 0.0000406343% 20.04 0.0167905979%
Ad 2 2674738 114388.54 0.04276626 0.000037388% 26.22 0.0229218766%
Ad 3 2848816 90113.08 0.031631766 0.0000351023% 15.29 0.0169675701%
Ad 4 2107497 82951.82 0.03936035 0.0000474497% 5.6 0.0067509067%
Ad 5 1087011 77140.49 0.070965694 0.0000919954% 19.32 0.0250452130%
Table 4: Detailed Comparison between an ad’s total budget and cost on a user sequence.

As proposed in Dantzig (1957), , the greedy algorithm achieves an approximation guarantee of . We can conclude from above statistics that , which means is much greater than .

The thesis above can be further proved:

  1. If the knapsack can hold all the items after the greedy algorithm, that is, the greedy solution is obviously equal to the optimal solution, which is also the approximately optimal solution.

  2. If the knapsack cannot hold all the items after the greedy algorithm, we have , that is, . According to Formula 13, we have


Therefore, in theory, the greedy solution in our online advertising settings is approximately optimal and the is much greater than 99.9% in our case.

b.2 Regretless Optimal Bidding Strategy

Theorem 3. During the online bidding phase, the bidding agent can always set the bid price as:


where . is a regretless optimal bidding strategy without any loss of accuracy.

Proof. Since is unknown until the current auction is finished, we prove the regretless of from the following two cases:

  1. If : , which means the agent should take action in this case. Exactly, is greater than the second highest price based on the condition for entering the current branch. Thus, the agent will always win the auction and the executed action is indeed .

  2. If : , which means the agent should take action in this case. Exactly, is less than the second highest price according to the condition. Thus, the agent will always lose the auction and the executed action is indeed .

Thus, we complete the proof.

b.3 Convergence Analysis of Msbcb

Figure 10: Convergence demonstration of MSBCB

The overall framework of MSBCB can be described as follows:

  1. Let the budget constraint of an advertiser be . Given a , we can use reinforcement learning algorithms to ensure that each user is optimized according to and converges to the optimal policy under the current . Further, picking all users whose will result in a total cost of (i.e., the advertiser spends a budget ).

  2. As the current estimated threshold might have some bias from the optimal , may not equal to the budget . Thus, we design a PID controller to dynamically adjust the estimated so as to minimize the gap between the budget constraint and the actual feedback of the daily cost .

As described in Figure 10, MSBCB repeats the above two steps iteratively. Given an updated , each will be optimized by the lower-level reinforcement learning algorithms and will move towards the optimal . As a result, users whose optimized will be selected and we get the daily cost . Then, the current will be updated so that the gap between the cost and the budget will be further minimized. Thus, will move towards the optimal gradually. As long as the learning rates of and are small enough, the overall iterations will finally converge. In this paper, we also validate the convergence of our MSBCB in the experiments. As shown in Section 4.2 of the paper, our method converges quickly and finally reaches an approximation ratio of 98.53%.

Appendix C Deployment

Here we give the online deployment details of our MSBCB.

c.1 Myopic to Non-Myopic Advertising System Upgrade Solution

Figure 11: Online System

A myopic advertising system includes several key components as Figure 11 shows: (1) Log module collects auction information and user feedback. (2) Training data are constructed based on log followed by model training with offline evaluation. (3) Real-time prediction (RTP) module provides service for myopic value prediction of user-ad pairs. RTP periodically pulls newly trained models. (4) Merger module receives the user visit, requests RTP for myopic value with which ad bid adjustment ratios and ranking scores are calculated (In advertising, ranking score is where is predicted Click Through Rate and is the bidding price). Finally, top-scored ads are delivered to the user. Above myopic advertising system can upgrade to a non-myopic system by considering the following key changes. (1) Log module needs to keep long-term auction information and users’ feedback, and these data are used to construct features and long-term labels for training. Besides, logged data have to track each advertised item’s budget and current cost data which are fed to a PID control module to compute for users selection in Merger. (2) Model training can use Monte Carlo (MC) or Temporal Difference (TD) methods. For MC, the long-term labels are cumulative rewards of a sequence and the training becomes a supervised regression problem. For TD, one-step or multi-step rewards are used to compute a bootstrapped long-term value using a separate network for training. (3) RTP module should periodically pull both myopic and non-myopic newly trained models and provide corresponding value prediction service. (4) Merger maintains an table which is updated periodically from PID module. When a user visit comes, Merger requests RTP for both and long-term values (long-term i.e. and i.e. in our paper), and with decides the selection of current user and bid adjustment.

c.2 Long-Term Value Prediction Model

Figure 12: Long-Term Value Prediction Model