Jointly Learning to Recommend and Advertise

Online recommendation and advertising are two major income channels for online recommendation platforms (e.g. e-commerce and news feed site). However, most platforms optimize recommending and advertising strategies by different teams separately via different techniques, which may lead to suboptimal overall performances. To this end, in this paper, we propose a novel two-level reinforcement learning framework to jointly optimize the recommending and advertising strategies, where the first level generates a list of recommendations to optimize user experience in the long run; then the second level inserts ads into the recommendation list that can balance the immediate advertising revenue from advertisers and the negative influence of ads on long-term user experience. To be specific, first level tackles high combinatorial action space problem that selects a subset items from the large item space; while the second level determines three internally related tasks, i.e., (i) whether to insert an ad, and if yes, (ii) the optimal ad and (iii) the optimal location to insert. The experimental results based on real-world data demonstrate the effectiveness of the proposed framework.


page 1

page 2

page 3

page 4


Deep Reinforcement Learning for Online Advertising in Recommender Systems

With the recent prevalence of Reinforcement Learning (RL), there have be...

Optimally Integrating Ad Auction into E-Commerce Platforms

Advertising becomes one of the most popular ways of monetizing an online...

Value-aware Recommendation based on Reinforced Profit Maximization in E-commerce Systems

Existing recommendation algorithms mostly focus on optimizing traditiona...

A Deep Prediction Network for Understanding Advertiser Intent and Satisfaction

For e-commerce platforms such as Taobao and Amazon, advertisers play an ...

Deep Reinforcement Learning for Personalized Search Story Recommendation

In recent years, search story, a combined display with other organic cha...

Cross DQN: Cross Deep Q Network for Ads Allocation in Feed

E-commerce platforms usually display a mixed list of ads and organic ite...

Learning to Advertise for Organic Traffic Maximization in E-Commerce Product Feeds

Most e-commerce product feeds provide blended results of advertised prod...

1. Introduction

Practical e-commerce or news-feed platforms generally expose a hybrid list of recommended and advertised items (e.g. products, services, or information) to users, where recommending and advertising algorithms are typically optimized by different metrics (Feng et al., 2018). The recommender systems (RS) capture users’ implicit preferences from historical behaviors (e.g. clicks, rating and review) and generate a set of items that best match users’ preference. Thus, RS aims at optimizing the user experience or engagement. While advertising systems (AS) assign the right ad to the right user on the right ad slots to maximize the revenue, click-through rate (CTR) or return on investment (ROI) from advertisers. Thus, optimizing recommending and advertising algorithms independently may lead to suboptimal overall performance since exposing more ads to increase advertising revenue has negative influence on user experience, vice versa. Therefore, there is an increasing demand for developing a uniform framework that jointly optimizes recommending and advertising, so as to optimize the overall performance.

Efforts have been made on displaying recommended and advertised items together. They consider ads as recommendations, and rank all items in a hybrid list to optimize the overall ranking score (Wang et al., 2018a). However, this approach has two major drawbacks. First, solely maximizing the overall ranking score may result in the suboptimal advertising revenue. Second, in the real-time bidding (RTB) environment, the vickrey-clarke-groves (VCG) mechanism is necessary to calculate the bid of each ad in the list, which suffers from many serious practical problems (Rothkopf, 2007). Therefore, it calls for methods where we can optimize not only the metrics for RS and AS separately, but also the overall performance. Moreover, more practical mechanisms such as generalized-second-price (GSP) should be considered to compute the bid of each ad.

Figure 1. An example of rec-ads mixed display for one user request.

To achieve the goal, we propose to study a two-level framework for rec-ads mixed display. Figure 1 illustrates the high-level idea about how the framework works. Upon a user’s request, the first level (i.e. RS) firstly generates a list of recommendations (rec-list) according to user’s historical behaviors, which aims to optimize the long-term user experience or engagement. The main challenge to build the first level is the high computational complexity of the combinatorial action space, i.e., selecting a subset of items from the large item space. Then the second level (i.e. AS) inserts ads into the rec-list generated from the first level, and it needs to make three decisions, i.e., whether to insert an ad into the given rec-list; and if yes, the AS also needs to decide which ad and where to insert. For example, in Figure 1, the AS decides to insert an advertisement between and of the rec-list. The optimal ad should jointly maximize the immediate advertising revenue from advertisers in the RTB environment and minimize the negative influence of ads on user experience in the long run. Finally, the target user browses the mixed rec-ads list and provides her/his feedback. According to the feedback, the RS and AS update their policy and generate the mixed rec-ads list for the next iteration.

Most existing supervised learning based recommending and advertising methods are designed to maximize the immediate (short-term) reward and suggest items following fixed greedy strategies. They overlook the long-term user experience and revenue. Thus, we build a two-level reinforcement learning framework for

Rec/Ads Mixed display (RAM), which can continuously update their recommending and advertising strategies during the interactions with users, and the optimal strategy is designed to maximize the expected long-term cumulative reward from users. Meanwhile, to effectively leverage users’ historical behaviors from other policies, we design an off-policy training approach, which can pre-train the framework before launching it online, so as to reduce the bad user experience in the initial online stage when new algorithms have not been well trained. We conduct experiments with real-world data to demonstrate the effectiveness of the proposed framework.

2. Problem Statement

As aforementioned in Section 1

, we consider the rec/ads mixed display task as a two-level reinforcement learning problem, and model it as a Markov Decision Process (MDP) where the RS and AS sequentially interact with users (environment

) by generating a sequence of rec-ads hybrid-list over time, so as to maximize the cumulative reward from . Next, we define the five elements of the MDP.

State space . A state includes a user’s recommendation and advertisement browsing history before time and the contextual information of current request at time . The generated rec-list from RS is also considered as a part of the state for the AS. Action space . is the action of RS and AS, where of RS is to generate a rec-list, and of AS is to determine three internally related decisions, i.e., whether to insert an ad in current rec-list; and if yes, the AS needs to choose a specific ad and insert it into the optimal location of the rec-list. We denote and as the rec and ad candidate sets for time , respectively. Without the loss of generality, we assume that the length of any rec-list is fixed and the AS can insert at most one ad into a rec-list. Reward . After an action is executed at the state , a user browses the mixed rec-ads list and provides her feedback. The RS and AS will receive the immediate reward and based on user’s feedback. We will discuss more details about the reward in following sections. Transition probability .

is the state transition probability from

to after executing action . The MDP is assumed to satisfy . Discount factor . Discount factor balances between current and future rewards – (1) : all future rewards are fully considered into current action; and (2) : only the immediate reward is counted.

Figure 2. The agent-user interactions in MDP.

Figure 2 illustrates the user-agent interactions in MDP. With the above definitions, the problem can be formally defined as follows: Given the historical MDP, i.e., , the goal is to find a two-level rec/ads policy , which can maximize the cumulative reward from users, i.e., simultaneously optimizing the user experience and the advertising revenue.

3. The Two-Level Reinforcement Learning Framework

In this section, we will discuss the two-level deep reinforcement learning framework for rec/ads mixed display. We will first introduce the first-level deep Q-network (i.e. RS) to generate a list of recommendations (rec-list) according to user’s historical behaviors, then we propose a novel DQN architecture as the second-level (i.e. AS) to insert ads into the rec-list generated from RS. Finally, we discuss how to train the framework via offline data.

3.1. Deep Q-network for Recommendations

Given a user request, RS will return a list of items according to user’s historical behaviors, which have two major challenges: (i) the high computational complexity of the combinatorial action space , i.e., selecting items from the large item space , and (ii) how to approximate the action-value function (Q-function) for a list of items in the two-level reinforcement learning framework. In this subsection, we introduce and enhance a cascaded version of Deep Q-network to tackle the above challenges. Next, we first introduce the processing of state and action features, and then we illustrate the cascaded Deep Q-network with optimization algorithm.

3.1.1. The Processing of State and Action Features for RS

As mentioned in Section 2, a state

includes a user’s rec/ads browsing history, and the contextual information of the current request. The browsing history contains a sequence of recommended items and a sequence of advertised items the user has browsed. Two RNNs with GRU (gated recurrent unit) are utilized to capture users’ preferences of recommendations and advertisements, separately. The final hidden state of RNN is used to represent user’s preference of recommended items

(or ads ). The contextual information of current user request includes app version, operation system (e.g., ios and android) and feed type (swiping up/down the screen), etc. The state is the concatenation and as:


For the transition from to , the browsed recommended and advertised items at time will be inserted into the bottom of and and we have and , respectively. For the action is the embedding of the list of items that will be displayed in current request. Next, we will detail the cascaded Deep Q-network.

Figure 3. The architecture of the cascaded DQN for RS.
Figure 4. (a)(b) Two conventional Deep Q-network architectures. (c) Overview of the proposed DQN Architecture.

3.1.2. The cascaded DQN for RS

Recommending a list of items from the large item space is challenging because (i) the combinatorial action space has high computational complexity, and (ii) the order of items in the list also matters (Zhao et al., 2018a). For example, a user may have different feedback to the same item if it is placed in different positions of the list. To resolve the above challenges, we leverage a cascaded version of DQN which generates a list by sequentially selecting items in a cascading manner (Chen et al., 2019). Given state , the optimal action is denoted as:


The key idea of the cascaded DQN is inspired by the fact that:


which implies a cascade of mutually consistent as:


By applying above functions in a cascading fashion, we can reduce the computational complexity of obtaining the optimal action from to

. Then the RS can sequentially select items following above equations. Note that the items already recommended in the recommendation session will be removed from being recommended again. Next, we will detail how to estimate


3.1.3. The estimation of cascaded Q-functions

Figure 3 illustrates the architecture of the cascaded DQN, where and are uses’ rec and ads browsing histories. The original model in (Chen et al., 2019) uses layers to process items separately without efficient weights sharing, which is crucial in handling large action size (Song et al., 2019). To address this challenge, we replace the separate layers by RNN with GRU, where the input of RNN unit is the feature of the item in the list, and the final hidden state of RNN is considered as the representation of the list. Since all RNN units share the same parameters, the framework is flexible to any action size .

To ensure that the cascaded DQN selects the optimal action, i.e., a sequence of optimal sub-actions, functions should satisfy a set of constraints as follows:


i.e., the optimal value of should be equivalent to for all

. The cascaded DQN enforces the above constraints in a soft and approximate way, where the loss functions are defined as follows:


i.e., all functions fit against the same target . Then we update the parameters of the cascaded DQN by performing gradient steps over the above loss. We will detail the reward function in the following subsections. Next, we will introduce the second-level DQN for advertising.

3.2. Deep Q-network for Online Advertising

As mentioned in Section 1, the advertising system (AS) is challenging. First, AS needs to make three decisions, i.e., whether, which and where to insert. Second, these three decisions are dependent and traditional DQN architectures cannot be directly applied. For example, only when the AS decides to insert an ad, AS also needs to determine the candidate ads and locations. Third, the AS needs to not only maximize the income of ads but also to minimize the negative influence on user experience. To tackle these challenges, next we detail a novel Deep Q-network architecture.

3.2.1. The Processing of State and Action Features for AS

We leverage the same architecture as that in Section 3.1.1 to obtain the state . Furthermore, since the task of AS is to insert ad into a given rec-list, the output of the first-level DQN, i.e., the current rec-list , is also considered as a part of the state for AS. For the action of AS, is the embedding of a candidate ad. Given the rec-list of items, there exist

possible locations. Thus, we use a one hot vector

to indicate the location to insert the selected ad.

3.2.2. The Proposed DQN Architecture

Given state and rec-list , the action of AS contains three sub-actions, i.e., (i) whether to insert an ad into rec-list ; if yes, (ii) which is the best ad and (iii) where is the optimal location. Note that in this work we suppose that the AS can insert an ad into a given rec-list at most. Next we discuss the limitations if we directly apply two classic Deep Q-network (DQN) architectures as shown in Figures 4(a) and (b) to the task. The architecture in Figure 4(a) inputs only the state ( and ) and outputs Q-values of all possible locations. This DQN can select the optimal location, while it cannot choose the optimal ad to insert. The architecture in Figure 4(b) takes a pair of state-action and outputs the Q-value for a specific ad. This DQN can determine the optimal ad but cannot determine where to insert the ad. One solution is to input the location information (e.g. one-hot vector). However, it needs evaluations to find the optimal Q-value, which is not practical in real-world advertising systems. Note that neither of the two classic DQNs can decide whether to insert an ad into a given rec-list.

To address above challenges, we propose a novel DQN architecture for online advertising in a given rec-list , as shown in Figure 4(c). It is built based on the two classic DQN architectures. The proposed DQN takes state (including ) and a candidate ad as input, and outputs the Q-values for all locations. This DQN architecture inherits the advantages of both two classic DQNs. It can evaluate the Q-values of the combination of two internally related types of sub-actions at the same time. In this paper, we evaluate the Q-values of all possible locations for an ad simultaneously. To determine whether to insert an ad (the first sub-action), we extend the output layer from to units, where the unit corresponds to the Q-value of not inserting an ad into rec-list . Therefore, our proposed DQN can simultaneously determine three aforementioned sub-actions according to the Q-value of ad-location combinations ; and when leads to the maximal Q-value, the AS will insert no ad into rec-list , where we use a zero-vector to represent inserting no ad.

More details of the proposed DQN architecture are illustrated in Figure 5. First, whether to insert an ad into a given rec-list is affected by and (especially the quality of the rec-list). For example, if a user is satisfied with a rec-list, the AS may prefer to insert an ad into the rec-list; conversely, if a user is unsatisfied with a rec-list and is likely to leave, then the AS won’t insert an ad. Second, the reward for an ad-location pair is related to all information. Thus, we divide the Q-function into a value function , related to and , and an advantage function , decided by , and  (Wang et al., 2015).

3.2.3. The Action Selection in RTB setting

In the real-time bidding environment, each ad slot is bid by advertisers in real-time when an impression is just generated from a consumer visit (Cai et al., 2017). In other words, given an ad slot, the specific ad to display is determined by the bids from advertisers, i.e. the bidding system (BS), rather than the platform, which aims to maximize the immediate advertising revenue of each ad slot from advertisers. In this paper, as mentioned in Section 1, the optimal ad selection policy should not only maximize the immediate advertising revenue (controlled by the BS), but also minimize the negative influence of ads on user experience in the long run (controlled by the AS). To achieve this goal, the AS will first calculate the Q-values for all candidate ads and all possible location, referred as to , which captures the long-term influence of ads on user experience; and then the BS will select the ad that achieves the trade-off between the immediate advertising revenue and the long-term Q-values:


where the operation goes through all candidate ads (input layer) and all locations (output layer), including the location that represents not inserting an ad. We will introduce more details about the bidding system (BS) in following sections.

It is noteworthy that we maximize immediate advertising revenue rather than long-term advertising revenue because: (i) as aforementioned the ad to insert is determined by advertisers rather than the platform (action is not generated by the agent); and (ii) in the generalized-second-price (GSP) setting, the highest bidder pays the price (immediate advertising revenue) bid by the second-highest bidder, if we use immediate advertising revenue as , then we cannot select an ad according to its that represents the long-term advertising revenue.

3.3. The Optimization Task

Given a user request and state , the RS and AS sequentially generate actions and , i.e., a rec-ads hybrid list, and then the target user will browse the list and provide her/his feedback. The two-level framework aims to (i) optimize the long-term user experience or engagement of recommended items (RS), (ii) maximize the immediate advertising revenue from advertisers in RTB environment (BS), and (iii) minimize the negative influence of ads on user long-term experience (AS), where the second sub-goal is automatically achieved by the bidding system, i.e., the advertiser with highest bid price will win the ad slot auction. Therefore, we next design proper reward functions to assist the RL components in the framework to achieve the first and third goals.

Figure 5. The architecture of the proposed DQN for AS.

The framework is quite general for the rec-ads mixed display applications in e-commerce, news feed and video platforms. Thus, for different types of platforms, we design different reward functions. For the first level DQN (RS), to evaluate the user experience, we have


where user experience is measured by the income of the recommended items in the hybrid list in e-commerce platforms, and the dwelling time of the recommendations in news/video platforms. Based on the reward function , we can update the parameters of the cascaded DQN (RS) by performing gradient steps over the loss in Eq (6). We introduce separated evaluation and target networks (Mnih et al., 2013) to help smooth the learning and avoid the divergence of parameters, where represents all parameters of the evaluation network, and the parameters of the target network are fixed when optimizing the loss function. The derivatives of loss function with respective to parameters are presented as follows:


where the target .

For the second level DQN (AS), since leaving the platforms is the major risk of inserting ads improperly or too frequently, we evaluate user experience by whether user will leave the platform after browsing current rec-ads hybrid list, and we have:


in other words, the AS will receive a positive reward (e.g. ) if the user continues to browse the next list, otherwise reward. Then the optimal , i.e., the maximum expected return achieved by the optimal policy, follows the Bellman equation (Bellman, 2013) as:


then the second level DQN can be optimized by minimizing the loss function as:


where is the target of the current iteration. We also introduce separated evaluation and target networks (Mnih et al., 2013) with parameters and for the second level DQN (AS), and is fixed when optimizing the loss function in Eq (12) (i.e. ). The derivatives of loss function w.r.t. parameters can be presented as:


where . The operation looks through the candidate ad set and all locations (including the location of inserting no ad).

Input: historical offline logs, replay buffer
Output: well-trained recommending policy and advertising policy

1:  Initialize the capacity of replay buffer
2:  Initialize action-value functions and with random weights
3:  for session  do
4:     Initialize state
5:     for  do
6:         Observe state
7:         Execute actions and sequentially according to the off-policy
8:         Observe rewards and from the offline log
9:         Update the state from to
10:         Store transition into the replay buffer
11:         Sample minibatch of transitions from the replay buffer
12:         Generate RS’s next action according to Eq.(4)
13:         Generate AS’s next action according to Eq.(7)
14:         Set
15:         Update parameters of by minimizing according to Eq.(9)
16:         Set
17:         Update parameters of by minimizing according to Eq.(13)
18:     end for
19:  end for
Algorithm 1 Off-policy Training of the RAM Framework.

3.4. Off-policy Training

Training the two-level reinforcement learning framework requires a large amount of user-system interaction data, which may result in bad user experience in the initial online stage when new algorithms have not been well trained. To address this challenge, we propose an off-policy training approach that effectively utilizes users’ historical behaviors log from other policies. The users’ offline log records the interaction history between behavior policy (the current recommendation and advertising strategies) and users’ feedback. Our RS and AS take the actions based on the off-policy and obtain feedback from the offline log. We present our off-policy training algorithm in detail shown in Algorithm 1.

Metrics Values Algorithms
value 17.610.16 17.950.19 18.560.21 18.990.18 19.610.23 19.490.16
improv.(%) 11.35 9.25 5.66 3.26 - 0.61
p-value 0.000 0.000 0.000 0.000 - 0.006
value 8.790.06 8.900.07 9.290.09 9.370.10 9.760.09 9.680.06
improv.(%) 11.03 9.66 5.06 4.16 - 0.83
p-value 0.000 0.000 0.000 0.000 - 0.009
value 1.070.03 1.130.02 1.230.04 1.340.03 1.490.06 1.560.07
improv.(%) 45.81 38.05 26.83 16.42 4.70 -
p-value 0.000 0.000 0.000 0.000 0.001 -
Table 1. Performance comparison.

There are two phases in each iteration of a training session. For the transition generation phase: for the state (line 6), the RS and AS sequentially act and based on the behavior policy (line 7) according to a standard off-policy way (Degris et al., 2012); then RS and AS receive the reward and from the offline log (line 8) and update the state to (line 9); and finally the RS and AS store transition into the replay buffer (line 10). For the model training phase: the proposed framework first samples minibatch of transitions from (line 11), then generates actions and of next iteration according to Eqs.(4) and (7) (lines 12-13), and finally updates parameters of and by minimizing Eqs.(6) and .(12) (lines 14-17). To help avoid the divergence of parameters and smooth the training, we introduce separated evaluation and target Q-networks (Mnih et al., 2013) . Note that when decides not to insert an ad (line 7), we denote as an all-zero vector.

1:  Initialize action-value functions and with well-trained weights
2:  for session  do
3:     Initialize state
4:     for  do
5:         Observe state
6:         Generate action according to Eq.(4)
7:         Generate action according to Eq.(7)
8:         Execute actions and
9:         Observe rewards and from user
10:         Update the state from to
11:     end for
12:  end for
Algorithm 2 Online Test of the RAM Framework.

3.5. Online Test

We present the online test procedure in Algorithm 2. The process is similar to the transition generation stage of Algorithm 1. Next we detail each iteration of test session as shown in Algorithm 2. First, the well-trained RS generates a rec-list by (line 6) according to the current state (line 5). Second, the well-trained AS, collaboratively working with BS, decides to insert an ad into the rec-list (or not) by (line 7). Third, the reward is observed from the target user to the hybrid list of recommended and advertised items (lines 8 and 9). Finally we transit the state to (line 10).

session user normal video ad video
1,000,000 188,409 17,820,066 10,806,778
session dwell time session length session ad revenue rec-list with ad
17.980 min 55.032 videos 0.667 55.23%
Table 2. Statistics of the dataset.

4. Experiment

In this section, we will conduct extensive experiments using data from a short video site to assess the effectiveness of the proposed RAM framework. We first introduce the experimental settings, then compare the RAM framework with state-of-the-art baselines, and finally conduct component and parameter analysis on RAM.

4.1. Experimental Settings

Since there are no public dataset consists of both recommended and advertised items, we collected a dataset from a short video site, TikTok, in March, 2019. In total, we collect 1,000,000 sessions in chronological order, where the first 70% is used as training/validation set and the later 30% is test set. For more statistics of the dataset, please see Table 2. There are two types of videos in the dataset: regular videos (recommended items) as well as ad videos (advertised items). The features for a normal video contain: id, like score, finish score, comment score, follow score and group score, where the scores are predicted by the platform. The features for an ad video consist of: id, image size, bid-price, hidden-cost, predicted-ctr and predicted-recall, where the last four are predicted by the platform. It is worth noting that (i) the effectiveness of the calculated features have been validated in the businesses of the short video site, (ii) we discretize each numeric feature into a one-hot vector, and (iii) baselines are base on the same features for a fair comparison.

Figure 6. Performance comparison of different variants

4.2. Evaluation Metrics

The reward to evaluate user experience of a list of regular videos is the dwell time (min), and the reward of ad videos is if users leave the site and if users continue to browse. We use the session dwell time , session length , and session ad revenue as metrics to measure the performance of a test session.

4.3. Architecture Details

Next we detail the architecture of RAM to ease reproductivity. The number of candidate regular/ad videos (selected by external recall systems) for a request is and respectively, and the size of regular/ad video representation is . There are regular videos in a rec-list. The initial state of a session is collected from its first three requests, and the dimensions of are

, respectively. For the second level DQN (AS), two separate 2-layer neural networks are respectively used to generate

and , where the output layer has units, i.e., possible ad locations including not to insert an ad. We empirically set the size of replay buffer 10,000, and the discounted factor of MDP . We select important hyper-parameters such as and via cross-validation, and we do parameter-tuning for baselines for fair comparison. In the following subsections, we will present more details of parameter sensitivity for the RAM framework.

4.4. Advertisement Selection

The advertising system (AS) and bidding system (BS) work collaboratively to insert an ad (or not) into a given rec-list, i.e., Equation (7). We design two AS+BS approaches as follows:

  • [leftmargin=*]

  • RAM-l: the optimal ad-location pair directly optimizes the linear summation of immediate advertising revenue and long-term user experience:


    where is the immediate advertising revenue if inserting an ad, otherwise 0;

  • RAM-n: this is a nonlinear approach that the AS first selects a subset of ad-location pairs (the size is ) that corresponds to optimal long-term user experience , then the BS chooses one that has the maximal immediate advertising revenue from the subset.

4.5. Overall Performance Comparison

The experiment is based on a simulated online environment, which can provide the , and according to a mixed rec-ads list. The simulator shares similar architecture to Figure 5, while the output layer predicts the dwell time, whether user will leave and the ad revenue of current mixed rec-ads list. We compare the proposed framework with the following representative baseline methods:

  • [leftmargin=*]

  • W&D (Cheng et al., 2016): This baseline jointly trains a wide linear model with feature transformations and a deep feedforward neural networks with embeddings for general recommender systems with sparse inputs. One W&D estimates the recommending scores of regular videos and each time we recommend videos with highest scores, while another W&D predicts whether to insert an ad and estimates the CTR of ads.

  • DFM (Guo et al., 2017): DeepFM is a deep model that incorporates W&D model with factorization-machine (FM). It models high-order feature interactions like W&D and low-order interactions like FM. Similar to W&D, we develop two DeepFMs for RS and AS.

  • GRU (Hidasi et al., 2015): GRU4Rec is an RNN with GRU to predict what user will click next according to her/his behavior histories. We also develop two neural networks for RS and AS, respectively.

  • DRQN (Hausknecht and Stone, 2015): Deep Recurrent Q-Networks addresses the partial observation problem by considering the previous context with a recurrent structure. DRQN uses an RNN architecture to encode previous observations before the current time. Two DRQNs are developed for RS and AS, respectively.

The results are shown in Table 1. We make the following observations: (1) GRU performs better than W&D and DFM, since W&D and DFM neglect users’ sequential behaviors of one session, while GRU can capture the sequential patterns. (2) DRQN outperforms GRU, since DRQN aims to maximize the long-term rewards of a session, while GRU targets at maximizing the immediate reward of each request. This result demonstrates the advantage of introducing RL for online recommendation and advertising. (3) RAM-l and RAM-n achieve better performance than DRQN, which validates the effectiveness of the proposed two-level DQN framework, where the RS generates a rec-list of recommendations and the AS decides how to insert ads. (4) RAM-n outperforms RAM-l in session ad revenue, since the second step of RAM-n will select the ad-location pair with maximal immediate advertising revenue, which has a higher probability of inserting ads. To sum up, RAM outperforms representative baselines, which demonstrates its effectiveness in online recommendation and advertising.

4.6. Component Study

To understand the impact of model components of RAM, we systematically eliminate the corresponding components of RAM by defining the following variants:

  • [leftmargin=*]

  • RAM-1: This variant has the same neural network architectures with the RAM framework, while we train it in the supervised learning way.

  • RAM-2

    : In this variant, we evaluate the contribution of recurrent neural networks, so we replace RNNs by fully-connected layers. Specifically, we concatenate recommendations or ads into one vector and then feed it into fully-connected layers.

  • RAM-3: In this variant, we use the original cascaded DQN architecture in (Chen et al., 2019) as RS.

  • RAM-4: For this variant, we do not divide the Q-function of AS into the value function and the advantage function .

  • RAM-5: This variant leverages an additional input to represents the location, and uses the DQN in Figure 4(b) for AS.

The results are shown in Figure 6. By comparing RAM and its variants, we make the following observations: (1) RAM-1 demonstrates the advantage of reinforcement learning over supervised learning for jointly optimizing recommendation and online advertising; (2) RAM-2 validates that capturing user’s sequential behaviors can enhance the performance; (3) RAM-3 proves the effectiveness of RNN over separate layers for larger action space. (4) RAM-4 suggests that dividing into and can boost the performance.(5) RAM-5 validates the advantage of the proposed AS architecture (over classic DQN architectures) that inputs a candidate ad and outputs the Q-value for all possible locations . In summary, leveraging suitable RL policy and proper neural network components can improve the overall performance.

4.7. Parameter Sensitivity Analysis

Our method has two key hyper-parameters, i.e., (i) the parameter of RAM-l, and (ii) the parameter of RAM-n. To study their sensitivities, we fix other parameters, and investigate how the RAM framework performs with the changes of or .

Figure 7(a) illustrates the sensitivity of . We observe that when increases, the metric improves, while the metric decreases. This observation is reasonable because when we decrease the importance of the second term of Equation (14), the AS will insert more ads or choose the ads likely having more revenue, while ignoring their negative impact on regular recommendations. Figure 7(b) shows the sensitivity of . With the increase of , we can observe that the metric improves and the metric decreases. With smaller , the first step of RAM-n prefers to selecting most ad-location pairs that not insert an ad, which results in lower and larger ; on the contrary, with larger , the first step returns more pairs with non-zero ad revenue, then the second step leads to higher . In a nutshell, both above results demonstrate that recommended and advertised items are mutually influenced: inserting more ads can lead to more ad revenue while worse user experience, vice versa. Therefore, online platforms should carefully select these hyper-parameters according to their business demands.

Figure 7. Parameter sensitivity analysis

5. Related Work

In this section, we will briefly summarize the related works of our study,which can be mainly grouped into the following categories.

The first category related to this paper is reinforcement learning-based recommender systems. A DDPG algorithm is used to mitigate the large action space problem in real-world RL-based RS (Dulac-Arnold et al., 2015). A tree-structured policy gradient is presented in (Chen et al., 2018a) to avoid the inconsistency of DDPG-based RS. Biclustering is also used to model RS as grid-world games to reduce action space (Choi et al., 2018). A Double DQN-based approximate regretted reward technique is presented to address the issue of unstable reward distribution in dynamic RS environment (Chen et al., 2018c). A pairwise RL-based RS framework is proposed to capture users’ positive and negative feedback to improve recommendation performance (Zhao et al., 2018b). A page-wise RS is proposed to simultaneously recommend a set of items and display them in a 2-dimensional page (Zhao et al., 2018a). A DQN based framework is proposed to address the issues in the news feed scenario, like only optimizing current reward, not considering labels, and diversity issue (Zheng et al., 2018). An RL-based explainable RS is presented to explain recommendations and can flexibly control the explanation quality according to the scenarios (Wang et al., 2018b). A policy gradient-based RS for YouTube is proposed to address the biases in logged data by introducing a simulated historical policy and a novel top-K off-policy correction (Chen et al., 2018b).

The second category related to this paper is RL-based online advertising techniques, which belong to two groups. The first group is guaranteed delivery (GD), where ads are charged according to a pay-per-campaign pre-specified number of deliveries (Salomatin et al., 2012). A multi-agent RL method is presented to control cooperative policies for the publishers to optimize their targets in a dynamic environment (Wu et al., 2018a). The second group is real-time bidding (RTB), which allows an advertiser to bid each ad impression in a very short time slot. Ad selection task is typically modeled as multi-armed bandit problem supposing that arms are iid, feedback is immediate and environments are stationary (Nuara et al., 2018; Gasparini et al., 2018; Tang et al., 2013; Xu et al., 2013; Yuan et al., 2013; Schwartz et al., 2017). The problem of online advertising with budget constraints and variable costs is studied in MAB setting (Ding et al., 2013), where pulling the arms of bandit results in random rewards and spends random costs. However, the MAB setting considers the bid decision as a static optimization problem, and the bidding for a given ad campaign would repeatedly happen until the budget runs out. To address these challenges, the MDP setting has also been studied for RTB (Cai et al., 2017; Wang et al., 2018a; Zhao et al., 2018a; Rohde et al., 2018; Wu et al., 2018b; Jin et al., 2018). A model-based RL framework is proposed to learn bid strategies in RTB setting (Cai et al., 2017), where state value is approximated by a neural network to better handle the large scale auction volume problem and limited budget. A model-free RL method is also designed to solve the constrained budget bidding problem, where a RewardNet is presented to generate rewards for reward design trap (Wu et al., 2018b). A multi-agent RL framework is presented to consider other advertisers’ bidding as the state, and a clustering method is leveraged to handle the large amount of advertisers issue (Jin et al., 2018).

6. Conclusion

In this paper, we propose a two-level deep reinforcement learning framework RAM with novel Deep Q-network architectures for the mixed display of recommendation and advertising in online recommender systems. Upon a user’s request, the RS (i.e. first level) first recommends a list of items based on user’s historical behaviors, then the AS (i.e. second level) inserts ads into the rec-list, which can make three decisions, i.e., whether to insert an ad into the rec-list; and if yes, the AS will select the optimal ad and insert it into the optimal location. The proposed two-level framework aims to simultaneously optimize the long-term user experience and immediate advertising revenue. It is worth to note that the proposed AS architecture can take advantage of two conventional DQN architectures, which can evaluate the Q-value of two kinds of related actions simultaneously. We evaluate our framework with extensive experiments based on data from a short video site. The results show that our framework can significantly improve online recommendation and advertising performance in online platforms.

There are several interesting research directions. First, besides dwell time, session length and advertising revenue, we would like to incorporate more metrics such as video finish rate and user followers. Second, in addition to optimizing recommendation and advertising performance in only one scenario (e.g. entrance page) of a platform, we plan to develop a framework that jointly optimizes multiple scenarios. Finally, the proposed framework RAM is quite general for evaluating the Q-values of two or more types of internally related actions, we would like to investigate more applications beyond short video site, such as news feed, e-commerce and video games.


  • R. Bellman (2013) Dynamic programming. Courier Corporation. Cited by: §3.3.
  • H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo (2017) Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 661–670. Cited by: §3.2.3, §5.
  • H. Chen, X. Dai, H. Cai, W. Zhang, X. Wang, R. Tang, Y. Zhang, and Y. Yu (2018a) Large-scale interactive recommendation with tree-structured policy gradient. arXiv preprint arXiv:1811.05869. Cited by: §5.
  • M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. Chi (2018b) Top-k off-policy correction for a reinforce recommender system. arXiv preprint arXiv:1812.02353. Cited by: §5.
  • S. Chen, Y. Yu, Q. Da, J. Tan, H. Huang, and H. Tang (2018c) Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1187–1196. Cited by: §5.
  • X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song (2019) Generative adversarial user model for reinforcement learning based recommendation system. In

    International Conference on Machine Learning

    pp. 1052–1061. Cited by: §3.1.2, §3.1.3, 3rd item.
  • H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, et al. (2016)

    Wide & deep learning for recommender systems

    In Proceedings of the 1st workshop on deep learning for recommender systems, pp. 7–10. Cited by: 1st item.
  • S. Choi, H. Ha, U. Hwang, C. Kim, J. Ha, and S. Yoon (2018) Reinforcement learning based recommender system using biclustering technique. arXiv preprint arXiv:1801.05532. Cited by: §5.
  • T. Degris, M. White, and R. S. Sutton (2012) Off-policy actor-critic. arXiv preprint arXiv:1205.4839. Cited by: §3.4.
  • W. Ding, T. Qin, X. Zhang, and T. Liu (2013) Multi-armed bandit with budget constraint and variable costs. In

    Twenty-Seventh AAAI Conference on Artificial Intelligence

    Cited by: §5.
  • G. Dulac-Arnold, R. Evans, H. van Hasselt, P. Sunehag, T. Lillicrap, J. Hunt, T. Mann, T. Weber, T. Degris, and B. Coppin (2015) Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679. Cited by: §5.
  • J. Feng, H. Li, M. Huang, S. Liu, W. Ou, Z. Wang, and X. Zhu (2018) Learning to collaborate: multi-scenario ranking via multi-agent reinforcement learning. In Proceedings of the 2018 World Wide Web Conference, pp. 1939–1948. Cited by: §1.
  • M. Gasparini, A. Nuara, F. Trovò, N. Gatti, and M. Restelli (2018) Targeting optimization for internet advertising by learning from logged bandit feedback. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §5.
  • H. Guo, R. Tang, Y. Ye, Z. Li, and X. He (2017) DeepFM: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247. Cited by: 2nd item.
  • M. Hausknecht and P. Stone (2015) Deep recurrent q-learning for partially observable mdps. In 2015 AAAI Fall Symposium Series, Cited by: 4th item.
  • B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk (2015) Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939. Cited by: 3rd item.
  • J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang (2018) Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2193–2201. Cited by: §5.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §3.3, §3.3, §3.4.
  • A. Nuara, F. Trovo, N. Gatti, and M. Restelli (2018) A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §5.
  • D. Rohde, S. Bonner, T. Dunlop, F. Vasile, and A. Karatzoglou (2018) RecoGym: a reinforcement learning environment for the problem of product recommendation in online advertising. arXiv preprint arXiv:1808.00720. Cited by: §5.
  • M. H. Rothkopf (2007) Thirteen reasons why the vickrey-clarke-groves process is not practical. Operations Research 55 (2), pp. 191–197. Cited by: §1.
  • K. Salomatin, T. Liu, and Y. Yang (2012) A unified optimization framework for auction and guaranteed delivery in online advertising. In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 2005–2009. Cited by: §5.
  • E. M. Schwartz, E. T. Bradlow, and P. S. Fader (2017) Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science 36 (4), pp. 500–522. Cited by: §5.
  • H. Song, H. Jang, H. T. Hong, S. Yun, D. Yun, H. Chung, and Y. Yi (2019) Solving continual combinatorial selection via deep reinforcement learning. Cited by: §3.1.3.
  • L. Tang, R. Rosales, A. Singh, and D. Agarwal (2013) Automatic ad format selection via contextual bandits. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp. 1587–1594. Cited by: §5.
  • W. Wang, J. Jin, J. Hao, C. Chen, C. Yu, W. Zhang, J. Wang, Y. Wang, H. Li, J. Xu, et al. (2018a) Learning to advertise with adaptive exposure via constrained two-level reinforcement learning. arXiv preprint arXiv:1809.03149. Cited by: §1, §5.
  • X. Wang, Y. Chen, J. Yang, L. Wu, Z. Wu, and X. Xie (2018b) A reinforcement learning framework for explainable recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 587–596. Cited by: §5.
  • Z. Wang, N. D. Freitas, and M. Lanctot (2015) Dueling network architectures for deep reinforcement learning. Cited by: §3.2.2.
  • D. Wu, C. Chen, X. Yang, X. Chen, Q. Tan, J. Xu, and K. Gai (2018a) A multi-agent reinforcement learning method for impression allocation in online display advertising. arXiv preprint arXiv:1809.03152. Cited by: §5.
  • D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai (2018b) Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1443–1451. Cited by: §5.
  • M. Xu, T. Qin, and T. Liu (2013) Estimation bias in multi-armed bandit algorithms for search advertising. In Advances in Neural Information Processing Systems, pp. 2400–2408. Cited by: §5.
  • S. Yuan, J. Wang, and M. van der Meer (2013) Adaptive keywords extraction with contextual bandits for advertising on parked domains. arXiv preprint arXiv:1307.3573. Cited by: §5.
  • X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang (2018a) Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Recommender Systems Conference, pp. 95–103. Cited by: §3.1.2, §5, §5.
  • X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin (2018b) Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1040–1048. Cited by: §5.
  • G. Zheng, F. Zhang, Z. Zheng, Y. Xiang, N. J. Yuan, X. Xie, and Z. Li (2018) DRN: a deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 167–176. Cited by: §5.