Deep Reinforcement Learning for Online Advertising in Recommender Systems

09/09/2019 ∙ by Xiangyu Zhao, et al. ∙ 0

With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in utilizing RL for online advertising in recommendation platforms (e.g. e-commerce and news feed sites). However, most RL-based advertising algorithms focus on solely optimizing the revenue of ads while ignoring possible negative influence of ads on user experience of recommended items (products, articles and videos). Developing an optimal advertising algorithm in recommendations faces immense challenges because interpolating ads improperly or too frequently may decrease user experience, while interpolating fewer ads will reduce the advertising revenue. Thus, in this paper, we propose a novel advertising strategy for the rec/ads trade-off. To be specific, we develop a reinforcement learning based framework that can continuously update its advertising strategies and maximize reward in the long run. Given a recommendation list, we design a novel Deep Q-network architecture that can determine three internally related tasks jointly, i.e., (i) whether to interpolate an ad or not in the recommendation list, and if yes, (ii) the optimal ad and (iii) the optimal location to interpolate. The experimental results based on real-world data demonstrate the effectiveness of the proposed framework.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Online advertising is a form of advertising that leverages the Internet to deliver promotional marketing messages to consumers. The goal of online advertising is to assign the right ads to the right consumers so as to maximize the revenue, click-through rate (CTR) or return on investment (ROI) of the advertising campaign. The two main marketing strategies in online advertising are guaranteed delivery (GD) and real-time bidding (RTB). For guaranteed delivery, ad exposures to consumers are guaranteed by contracts signed between advertisers and publishers in advance [Jia et al.2016]. For real-time bidding, each ad impression is bid by advertisers in real-time when an impression is just generated from a consumer visit [Cai et al.2017]. However, the majority of online advertising techniques are based on offline/static optimization algorithms that treat each impression independently and maximize the immediate revenue for each impression, which is challenging in real-world business, especially when the environment is unstable. Therefore, great efforts have been made on developing reinforcement learning based online advertising techniques [Cai et al.2017, Wang et al.2018a, Zhao et al.2018b, Rohde et al.2018, Wu et al.2018b, Jin et al.2018], which can continuously update their advertising strategies during the interactions with consumers and the optimal strategy is made by maximizing the expected long-term cumulative revenue from consumers. However, most existing works focus on maximizing the income of ads, while ignoring the negative influence of ads on user experience for recommendations.

Figure 1: An example of online advertising within recommendation list for one user request

Designing an appropriate advertising strategy is a challenging problem, since (i) displaying too many ads or improper ads will degrade user experience and engagement; and (ii) displaying insufficient ads will reduce the advertising revenue of the platforms. In real-world platforms, as shown in Figure 1, ads are often displayed with normal recommended items, where recommendation and advertising strategies are typically developed by different departments, and optimized by different techniques with different metrics [Feng et al.2018]. Upon a user’s request, the recommendation system firstly generates a list of recommendations according to user’s interests, and then the advertising system needs to make three decisions (sub-actions), i.e., whether to interpolate an ad in current recommendation list (rec-list); and if yes, the advertising system also needs to choose the optimal ad and interpolate it into the optimal location (e.g. in Figure 1 the advertising agent (AA) decides to interpolate an ad between and of the rec-list). The first sub-action maintains the frequency of ads, while the other two sub-actions aims to control the appropriateness of ads. The goal of advertising strategy is to simultaneously maximize the income of ads and minimize the negative influence of ads on user experience.

Figure 2: Classic DQN architectures for online advertising.

The above-mentioned three decisions (sub-actions) are internally related, i.e., (only) when the AA decides to interpolate an ad, the locations and candidate ads together determine the rewards. Figure 2 illustrates the two conventional Deep Q-network (DQN) architectures for online advertising. Note that in this paper we suppose (i) there are candidate ads for each request, and (ii) the length of the recommendation list (or rec-list) is . The DQN in Figure 2(a) takes the state space and outputs Q-values of all locations. This architecture can determine the optimal location but cannot choose the specific ad to interpolate. The DQN in Figure 2

(b) inputs a state-action pair and outputs the Q-value corresponding to a specific action (ad). This architecture can select a specific ad but cannot decide the optimal location. Taking a representation of location (e.g. one-hot vector) as the additional input is an alternative way, but

evaluations are necessary to find the optimal action-value function , which prevents the DQN architecture from being adopted in practical advertising systems. It is worth to note that both architectures cannot determine whether to interpolate an ad (or not) into a given rec-list. Thus, in this paper, we design a new DEep reinforcement learning framework with a novel DQN architecture for online Advertising in Recommender systems (DEAR), which can determine the aforementioned three tasks simultaneously with reasonable time complexity. We summarize our major contributions as follows:

  • We identify the phenomena of online advertising with recommendations and provide a principled approach for better advertising strategy;

  • We propose a deep reinforcement learning based framework DEAR and a novel Q-network architecture, which can simultaneously determine whether to interpolate an ad, the optimal location and which ad to interpolate;

  • We demonstrate the effectiveness of the proposed framework in real-world short video site.

2 Problem Statement

In this paper, we study the advertising problem within a rec-list as a Markov Decision Process (MDP), in which an Advertising-Agent (AA) interacts with environment

(or users) by sequentially interpolating ads into a sequence of rec-lists over time, so as to maximize the cumulative reward from the environment. Formally, the MDP consists of a tuple of five elements :

  • State space : A state is defined as a user’s browsing history before time and the information of current request at time . More specifically, a state consists of a user’s recommendation and ad browsing history, the rec-list and contextual information of current request.

  • Action space : The action of AA is to determine three internally related tasks, i.e., whether interpolate an ad in current rec-list (that is considered in , more details are presented in following sections); if yes, the AA needs to choose a specific ad and interpolate it into the optimal location in the rec-list. Without the loss of generality, we assume that the AA could interpolate at most one ad into a rec-list, but it is straightforward to extend it with multiple ads.

  • Reward : After the AA taking an action at the state , i.e., (not) interpolating an ad into a rec-list, a user browses this mixed rec-ad list and provides her feedback. The AA will receive the immediate reward based on user’s feedback. The reward is two-fold: (i) the income of ad that depends on the quality of the ad, and (ii) the influence of an ad on the user experience.

  • Transition probability

    : Transition probability

    defines the state transition from to after taking action . We assume that the MDP satisfies .

  • Discount factor : Discount factor is introduced to measure the present value of future reward. When , all future rewards will be fully counted into current action; on the contrary, when , only the immediate reward will be considered.

With the above-mentioned notations and definitions, the problem of ad interpolation into recommendation lists can be formally defined as follows: Given the historical MDP, i.e., , the goal is to find an advertising policy , which can maximize the cumulative reward from users, i.e., maximizing the income of ads and minimizing the negative influence on user experience.

3 The Proposed Framework

In this section, we will propose a deep reinforcement learning framework for online advertising in recommender systems. To be more specific, we will first propose a novel DQN architecture, which could tackle the aforementioned three tasks simultaneously. Then, we discuss how to train the framework via offline users’ behavior log.

Figure 3: (a) Overview of the proposed DQN architecture. (b) The detailed architecture of the proposed DQN.

3.1 The DQN Architecture for Online Advertising

As aforementioned the online advertising in recommender system problem is challenging because (i) the action of the advertising agent (AA) is complex which consists of three sub-actions, i.e., whether interpolate an ad into current rec-list, if yes, which ad is optimal and where is the best location; (ii) the three sub-actions are internally related, i.e., when the AA decides to interpolate an ad, the candidate ads and locations are interactive to maximize the reward, which prevents traditional DQN architectures from being employed in online advertising systems; and (iii) the AA should simultaneously maximize the income of ads and minimize the negative influence of ads on user experience. To address these challenges, we propose a deep reinforcement learning framework with a novel Deep Q-network architecture. In the following, we first introduce the processing of state and action features, and then we illustrate the proposed DQN architecture with optimization algorithm.

3.1.1 The Processing of State and Action Features

The state

consists of a user’s rec/ads browsing history, the contextual information and rec-list of current request. The recommendation (or ad) browsing history is a sequence of recommendations (or ads) the user has browsed. We leverage two RNNs with Gated Recurrent Units (GRU) to capture users’ sequential preference of recommendations and ads separately. The inputs of RNN are the features of user’s recently browsed recommendations (or ads), while we use the final hidden state of RNN as the representation of user’s dynamic preference of recommendations

(or ads

). Here we leverage GRU rather than Long Short-Term Memory (LSTM) because of GRU’s simpler architecture and fewer parameters.

The contextual information feature of current user request consists of information such as the OS (ios or android), app version and feed type (swiping up/down the screen) of user’s current request. Next, we represent the rec-list of current request by the concatenated features of recommended items that will be displayed in current request, and we transform them into a low-dimensional dense vector . Note that other architectures like CNN for NLP [Kim2014] can also be leveraged. Finally, we get a low-dimensional representation of state by concatenating and :


For the transition from to , the recommendations and ads browsed at time will be added into browsing history to generate and , depends on user’s behavior at time , and comes from the recommendation system. For the action , is the feature of a candidate ad, and is the location to interpolate the selected ad (given a list of recommendations, there exist possible locations). Next, we will elaborate the architecture of the proposed DQN architecture.

3.1.2 The Proposed DQN Architecture

Given the state , the action of AA consists three sub-actions, i.e., whether to interpolate an ad, if yes, (ii) where is the optimal location and (iii) which ad is optimal.

We first consider to simultaneously tackle the sub-action (ii) and (iii). In other words, we aim to estimate the Q-values of all possible locations

for any given candidate ad . To incorporate these two sub-actions into one framework, we proposed a novel DQN architecture, as illustrated in Figure 3(a), which is on the top of the two conventional Deep Q-network architectures shown in Figure 2. The inputs are the representations of state and any candidate ad , while the output is the action-value (Q-value) corresponding to locations. In this way, the proposed DQN architecture could take advantage of both two traditional DQN architectures, which could simultaneously evaluate the Q-values of two types of internally related sub-actions, i.e., evaluating the Q-values of all possible locations for an ad.

To incorporate the first sub-action (whether to interpolate an ad or not) into the above DQN architecture, we consider not interpolating an ad as a special , and extend the length of output layer from to , where corresponds to the Q-value of not incorporating an ad into current rec-list. Therefore, the proposed DQN architecture could take the three sub-actions simultaneously, where the Q-value depends on the combination of ad-location pair; and when of any candidate ads corresponds to the maximal Q-value, the AA will not interpolate an ad into current rec-list.

The detailed DQN architecture is illustrated in Figure 3(b). On one hand, whether to interpolate an ad into current rec-list is mainly impacted by the state (the browsing history, the contextual information and especially the quality of current rec-list), e.g., if a user has good experience for current rec-list, the advertising agent may prefer to interpolate an ad into the current rec-list; while if a user has bad experience for current rec-list, the user has high possibility to leave, then the AA will not insert an ad to increase this possibility. On the other hand, the reward for choosing an ad and location is closely related to all the features (both current rec-list and the ads). According to this observation, we divide the Q-function into value function , which is determined by the state features, and the advantage function , which is determined by the features of both state and action [Wang, Freitas, and Lanctot2015].

3.1.3 Discussion

There exist two classic DQN architectures as illustrated in Figure 2, where (i) the left one takes state as input and outputs Q-values of all actions, and (ii) the right one takes a state-action pair and outputs the Q-value of this pair. These two conventional architectures can only evaluate Q-values for one level of actions, e.g., the agent in Maze environment can only choose to go up, down, left, or right [Brockman et al.2016]. Compared with these two traditional architectures, the proposed DEAR takes a state-action pair of one level of actions, and outputs the Q-values corresponding to the combination of this state-action pair and another level of actions. Hierarchical reinforcement learning (HRL) architectures like [Kulkarni et al.2016] can also handle multiple levels of tasks. However, HRL frameworks suffer from the instability problem when training multiple levels jointly [Nachum et al.2018]. To the best of our knowledge, the proposed DEAR architecture is the first individual DQN architecture that can evaluate the Q-values of two or more levels of internally related actions simultaneously with reasonable time complexity. This design is general which has many other possible applications. For example, in Maze environment the input of DEAR can be the pair of agent’s location (state) and the direction to go (action), then the DEAR can output the Q-values corresponding to the location, direction and how many steps to go in this direction (another level of related actions).

3.2 The Reward Function

After the AA executing an action at the state , i.e., interpolating an ad into a rec-list (or not), a user browses this mixed rec-ad list and provides her feedback. In online advertising with normal recommendation problem, the AA aims to simultaneously maximize the income of ads and minimize the negative influence of ads on user experience (i.e. to optimize user experience). Thus the immediate reward is two-fold: (i) the income of ad , and (ii) the user experience .

In practical platforms, the major risk of interpolating ads improperly or too frequently is that user will leave the platforms. Thus, user experience is measured by whether she/he will leave the platform after browsing the current rec-ad list, and we have:


in other word, the AA will receive a positive reward if the user continue to browse the next list, otherwise negative reward. Then, we design the reward function as follows:


where the is the income of ad, which is a positive value if interpolate an ad, otherwise 0. The hyper-parameter controls the importance of the second term, which measures the influence of an ad on user experience. Based on the reward function, optimal action-value function , which has the maximum expected return achievable by the optimal policy, should follow the Bellman equation [Bellman2013] as:


where the operation needs to look through all candidate ads (input) and all locations (output).

1:  Initialize the capacity of replay buffer
2:  Initialize action-value function with random weights
3:  for session  do
4:     Initialize state from previous sessions
5:     for  do
6:        Observe state
7:        Execute action following off-policy
8:        Calculate reward from offline log
9:        Update state to
10:        Store transition into the replay buffer
11:        Sample mini-batch of transitions from the replay buffer
12:        Set
13:        Minimize according to Eq.(6)
14:     end for
15:  end for
Algorithm 1 Off-policy Training of DEAR Framework.
session user normal video ad video
1,000,000 188,409 17,820,066 10,806,778
session dwell time session length session ad revenue rec-list with ad
17.980 min 55.032 videos 0.667 55.23%
Table 1: Statistics of the dataset.

3.3 The Optimization Task

The Deep Q-network, i.e., action-value function

, can be optimized by minimizing a sequence of loss functions



where is the target for the current iteration. We introduce separated evaluation and target networks [Mnih et al.2013] to help smooth the learning and avoid the divergence of parameters, where represents all parameters of the evaluation network, and the parameters of the target network are fixed when optimizing the loss function . The derivatives of loss function with respective to parameters are presented as follows:


where , and will look through the candidate ad set and all locations (including the location that represents not interpolating an ad). Note that a recall mechanism is employed by the platform to select a subset of ads that may generate maximal revenue, and filter out ads that run out of their budget (RTB) or have fulfilled the guaranteed delivery amount (GD). In this paper, we mainly focus on the income of platform and user experience.

3.4 Off-policy Training Task

We train the proposed framework based on users’ offline log, which records the interaction history between behavior policy (the advertising strategy in use) and users’ feedback. Our AA takes the action based on the off-policy and obtains the feedback from the offline log. We present our off-policy training algorithm in details in Algorithm 1.

In each iteration of a training session, there are two stages. For storing transitions stage: given the state (line 6), the AA takes action according to the behavior policy (line 7), which follows a standard off-policy way [Degris, White, and Sutton2012]; then the AA observes the reward from offline log (line 8) and updates the state to (line 9); and finally the AA stores transition into replay buffer (line 10). For model training stage: the AA samples minibatch of transitions from replay buffer (line 11), and then updates the parameters according to Equation (6) (lines 13). Note that in line 7, when the behavior policy decides not to interpolate an ad, we use an all-zero vector as .

1:  Initialize the proposed DQN with well trained weights
2:  for session  do
3:     Initialize state
4:     for  do
5:        Observe state
6:        Execute action following
7:        Observe rewards from user
8:        Update the state from to
9:     end for
10:  end for
Algorithm 2 Online Test of the DEAR Framework.

3.5 Online Test Task

The online test algorithm is presented in Algorithm 2, which is similar to the transition generating stage in Algorithm 1. In each iteration of the session, given the current state (line 5), the AA decides to interpolate an ad into the rec-list (or not) by the well-trained advertising policy (line 6), then the target user browses the mixed rec-ad list and provides her/his feedback (line 7). Finally, the AA updates the state to (line 8) and goes to the next iteration.

4 Experiments

In this section, we conduct extensive experiments on a real short video site to evaluate the effectiveness of the proposed framework. We mainly focus on three questions: (i) how the proposed framework performs compared to representative baselines; (ii) how the components in the framework contribute to the performance; and (iii) how the hyper-parameters impact the performance. We first introduce experimental settings. Then we seek answers to the above three questions.

4.1 Experimental Settings

Since there are no public dataset consists of both recommended and advertised items, we train our model on the dataset of March, 2019 collected in a short video site, where there are two types of videos, i.e., normal videos (recommended items) and ad videos (advertised items). The features for a normal video contain: id, like score, finish score, comment score, follow score and group score, where the scores are predicted by the platform. The features for an ad video consist of: id, image size, pricing, hidden-cost, rc-preclk and recall-preclk, where the last four are predicted by the platform. Note that (i) the predicted features are widely and successfully used in many applications such as recommendation and advertising in the platform, (ii) we discretize each feature as a one-hot vector, and (iii) the same features are used by baselines for a fair comparison. We collect 1,000,000 sessions in temporal order to train the proposed framework via an off-policy manner. More statistics about the dataset are shown in Table 1.

For the architecture of DEAR, the dimension of ad and normal videos’ features is , the length of a rec-list is , and the size of ad candidate set for a request is . For a given session, the initial user browsing history is collected from the first three requests of the session. The dimensions of are

. We leverage two 2-layer neural network to generate

and , respectively. The length of the output layer is , i.e., there are possible locations including the one representing not to interpolate an ad. We set the discounted factor , and the size of replay buffer is 10,000. For the hyper-parameters of the proposed framework such as , we select them via cross-validation. Correspondingly, we also do parameter-tuning for baselines for a fair comparison. We will discuss more details about hyper-parameter selection for the DEAR framework in the following subsections. Reward is the revenue of ad videos, and is 1 if user continue to browse next list and 0 otherwise. To measure the online performance, we leverage the accumulated rewards in the session as the metric.

Figure 4: Overall performance comparison.

4.2 Overall Performance Comparison

We compare the proposed framework with the following representative baseline methods:

  • W&D [Cheng et al.2016]

    : This baseline is a wide & deep model for jointly training feed-forward neural networks with embeddings and linear model with feature transformations for generic recommender systems with sparse inputs. We further augment it W&D to predict whether interpolate an ad and estimate the CTR of ads. W&D is the behavior policy

    in use of the video platform.

  • DFM [Guo et al.2017]: DeepFM is a deep neural network model that integrates the architectures of factorization-machine (FM) and wide & deep model. It models low-order feature interactions like FM and models high-order feature interactions like W&D.

  • GRU [Hidasi et al.2015]: GRU4Rec utilizes RNN with Gated Recurrent Units (GRU) to predict what user will click/order next based on the clicking/ordering histories. We also augment it for ads interpolation.

  • HDQN [Kulkarni et al.2016]: This baseline is a hierarchical DQN framework where the high-level DQN determines the locations (including the location of not interpolating ad), and the low-level DQN selects a specific ad.

The results are shown in Figure 4. We make the following observations:

  1. The DFM achieves better performance than W&D, where DeepFM can be trained end-to-end without any feature engineering, and its wide part and deep part share the same input and also the embedding vector.

  2. GRU outperforms W&D and DFM, since GRU can capture the temporal sequence of user behaviors within one session, while W&D and DFM neglect it.

  3. HDQN performs better than GRU, since GRU is designed to maximize the immediate reward of each request, while HDQN aims to maximize the rewards in the long run. This result suggests that introducing reinforcement learning can improve the long-term performance of online recommendation and advertising.

  4. DEAR outperforms HDQN, since HRL frameworks like HDQN are not stable when multiple levels are jointly trained by an off-policy manner [Nachum et al.2018].

To sum up, DEAR outperforms representative baselines, which demonstrates its effectiveness in online advertising. Note that the improvement of DEAR is significant (), we omit the results of hypothesis test because of the space limitation.

reward improvement p-value
DEAR-1 9.936 10.32% 0.000
DEAR-2 10.02 9.056% 0.000
DEAR-3 10.39 5.495% 0.001
DEAR-4 10.57 3.689% 0.006
DEAR 10.96 - -
Table 2: Component study results.

4.3 Component Study

To answer the second question, we systematically eliminate the corresponding components of DEAR by defining the following variants:

  • DEAR-1

    : This variant shares the same architectures with the proposed model, while we train the framework through a supervised learning manner.

  • DEAR-2: This variant is to evaluate the effectiveness of RNNs, hence we replace each RNN by two fully-connected layers (FCNs), concatenate recommended or advertised items as one vector and feed it into the corresponding FCN.

  • DEAR-3: This baseline leverages the DQN architecture in Figure 2(b) with an additional input, which represents the location by a one-hot vector.

  • DEAR-4: The architecture of this variant does not divide the Q-function into the value function and the advantage function for the AA.

The results are shown in Table 2. It can be observed:

  1. DEAR-1 validates the effectiveness of introducing reinforcement learning for online advertising.

  2. DEAR-2 demonstrates that capture user’s sequential preference over recommended and advertised items can boost the performance.

  3. DEAR-3 validates the effectiveness of the proposed DEAR architecture over conventional DQN architecture that takes an ad as input while outputs the Q-value corresponding to all possible locations for the given ad .

  4. DEAR-4 proves that whether interpolate an ad into rec-list is mainly depended on state (especially the quality of rec-list), while the reward for selecting an ad and location depends on both and (ad). Thus dividing into the value function and the advantage function can improve the performance.

In summary, introducing RL and appropriately designing neural network architecture can boost the performance.

Figure 5: Parameter sensitivity analysis.

4.4 Parameter Sensitivity Analysis

In this section, we investigate how the proposed framework DEAR performs with the changes of in Equation (3), while fixing other parameters. We select the accumulated and accumulated of the whole session as the metrics to evaluate the performance.

Figure 5 demonstrates the parameter sensitivity of . We find that with the increase of , the performance of improves, while decreases. On one hand, when we increase the importance of the second term in Equation (3), the AA tends to interpolate fewer ads or select the ads that will not decrease user’s experience, although they may generate suboptimal revenue. On the other hand, when we decrease the importance of the second term of Equation (3), the AA prefers to interpolate more ads or choose the ads that will lead to maximal revenue, while ignoring the negative impact of ad on user’s experience.

5 Related Work

In this section, we briefly review works related to our study. In general, the related work can be mainly grouped into the following categories.

The first category related to this paper is guaranteed delivery, where ads that share a single idea and theme are grouped into campaigns, and are charged on a pay-per-campaign basis for the pre-specified number of deliveries (click or impressions) [Salomatin, Liu, and Yang2012]. Most popular GD (Guaranteed Delivery) solutions are based on offline optimization algorithms, and then adjusted for online setup. However, deriving the optimal strategy to allocate impressions is challenging, especially when the environment is unstable in real-world applications. In [Wu et al.2018a], a multi-agent reinforcement learning (MARL) approach is proposed to derive cooperative policies for the publisher to maximize its target in an unstable environment. They formulated the impression allocation problem as an auction problem where each contract can submit virtual bids for individual impressions. With this formulation, they derived the optimal impression allocation strategy by solving the optimal bidding functions for contracts.

The second category related to this paper is RTB, which allows an advertiser to submit a bid for each individual impression in a very short time frame. Ad selection task is typically modeled as multi-armed bandit (MAB) problem with the setting that samples from each arm are iid, feedback is immediate and rewards are stationary [Yang and Lu2016, Nuara et al.2018, Gasparini et al.2018, Tang et al.2013, Xu, Qin, and Liu2013, Yuan, Wang, and van der Meer2013, Schwartz, Bradlow, and Fader2017]. The problem of multi-armed bandits with budget constraints and variable costs is studied in [Ding et al.2013]. In this case, pulling the arms of bandit will get random rewards with random costs, and the algorithm aims to maximize the long-term reward by pulling arms with a constrained budget. Under the MAB setting, the bid decision is considered as a static optimization problem of either treating the value of each impression independently or setting a bid price to each segment of ad volume. However, the bidding for a given ad campaign would repeatedly happen during its life span before the budget running out. Thus, the MDP setting has also been studied [Cai et al.2017, Wang et al.2018a, Zhao et al.2018b, Rohde et al.2018, Wu et al.2018b, Jin et al.2018]. A model-based reinforcement learning framework is proposed to learn bid strategies in RTB advertising [Cai et al.2017], where neural network is used to approximate the state value, which can better deal with the scalability problem of large auction volume and limited campaign budget. A model-free deep reinforcement learning method is proposed to solve the bidding problem with constrained budget [Wu et al.2018b]: the problem is modeled as a -control problem, and RewardNet is designed for generating rewards to solve reward design trap, instead of using the immediate reward. A multi-agent bidding model takes the other advertisers’ bidding in the system into consideration, and a clustering approach is introduced to solve the large number of advertisers challenge [Jin et al.2018].

The third category related to this paper is reinforcement learning based recommender systems, which typically consider the recommendation task as a Markov Decision Process (MDP), and model the recommendation procedure as sequential interactions between users and recommender system [Zhao et al.2019b, Zhao et al.2018a]. Practical recommender systems are always with millions of items (discrete actions) to recommend [Zhao et al.2016, Guo et al.2016]. Thus, most RL-based models will become inefficient since they are not able to handle such a large discrete action space. A Deep Deterministic Policy Gradient (DDPG) algorithm is introduced to mitigate the large action space issue in practical RL-based recommender systems [Dulac-Arnold et al.2015]. To avoid the inconsistency of DDPG and improve recommendation performance, a tree-structured policy gradient is proposed in [Chen et al.2018a]. Biclustering technique is also introduced to model recommender systems as grid-world games so as to reduce the state/action space [Choi et al.2018]. To solve the unstable reward distribution problem in dynamic recommendation environments, approximate regretted reward technique is proposed with Double DQN to obtain a reference baseline from individual customer sample [Chen et al.2018c]. Users’ positive and negative feedback, i.e., purchase/click and skip behaviors, are jointly considered in one framework to boost recommendations, since both types of feedback can represent part of users’ preference [Zhao et al.2018c]. Architecture aspect and formulation aspect improvement are introduced to capture both positive and negative feedback in a unified RL framework. A page-wise recommendation framework is proposed to jointly recommend a page of items and display them within a 2-D page [Zhao et al.2017, Zhao et al.2018b]. CNN technique is introduced to capture the item display patterns and users’ feedback of each item in the page. A multi-agent model-based reinforcement learning framework (DeepChain) is proposed for the whole-chain recommendation problem [Zhao et al.2019c], which is able to collaboratively train multiple recommendation agents for different scenarios by a model-based optimization algorithm. A user simulator RecSimu base on Generative Adversarial Network (GAN) framework is presented for RL-based recommender systems [Zhao et al.2019a], which models real users’ behaviors from users’ historical logs, and tackle the two challenges: (i) the recommended item distribution is complex within users’ historical logs, and (ii) labeled training data from each user is limited. In the news feed scenario, a DQN based framework is proposed to handle the challenges of conventional models, i.e., (1) only modeling current reward like CTR, (2) not considering click/skip labels, and (3) feeding similar news to users [Zheng et al.2018]. An RL framework for explainable recommendation is proposed in [Wang et al.2018b], which can explain any recommendation model and can flexibly control the explanation quality based on the application scenario. A policy gradient-based top-K recommender system for YouTube is developed in [Chen et al.2018b], which addresses biases in logged data through incorporating a learned logging policy and a novel top-K off-policy correction. Other applications includes sellers’ impression allocation [Cai et al.2018a], fraudulent behavior detection [Cai et al.2018b], and user state representation [Liu et al.2018].

6 Conclusion

In this paper, we propose a deep reinforcement learning framework DEAR with a novel Deep Q-network architecture for online advertising in recommender systems. It is able to (i) determine three internally related actions at the same time, i.e., whether to interpolate an ad in a rec-list or not, if yes, which is the optimal ad and location to interpolate; and (ii) simultaneously maximize the revenue of ads and minimize the negative influence of ads on user experience. It is worth to note that the proposed DQN architecture can take advantage of two conventional DQN architectures, which can evaluate the Q-value of two or more kinds of related actions simultaneously. We evaluate our framework with extensive experiments based on a short video site. The results show that our framework can significantly improve online advertising performance in recommender systems.

There are several interesting research directions. First, in addition to only optimizing advertising strategies in recommender systems, we would like to develop a framework that jointly optimizes advertising and recommending strategies simultaneously. Second, the proposed framework DEAR is quite general for evaluating the Q-value of two or more types of internally related actions, we would like to investigate more applications beyond online advertising, such as recommendations and video games.