Hybrid Transfer in Deep Reinforcement Learning for Ads Allocation

04/02/2022
by   Guogang Liao, et al.
0

Ads allocation, that allocates ads and organic items to limited slots in feed with the purpose of maximizing platform revenue, has become a popular problem. However, e-commerce platforms usually have multiple entrances for different categories and some entrances have few visits. Data accumulated on these entrances can hardly support the learning of a good agent. To address this challenge, we present Similarity-based Hybrid Transfer for Ads Allocation (SHTAA), which can effectively transfer the samples as well as the knowledge from data-rich entrance to other data-poor entrance. Specifically, we define an uncertainty-aware Markov Decision Process (MDP) similarity which can estimate the MDP similarity of different entrances. Based on the MDP similarity, we design a hybrid transfer method (consisting of instance transfer and strategy transfer) to efficiently transfer the samples and the knowledge from one entrance to another. Both offline and online experiments on Meituan food delivery platform demonstrate that our method can help to learn better agent for data-poor entrance and increase the revenue for the platform.

READ FULL TEXT VIEW PDF
04/02/2022

Learning List-wise Representation in Reinforcement Learning for Ads Allocation with Multiple Auxiliary Tasks

With the recent prevalence of reinforcement learning (RL), there have be...
12/05/2019

Dynamic Pricing on E-commerce Platform with Deep Reinforcement Learning

In this paper we present an end-to-end framework for addressing the prob...
04/01/2022

Deep Page-Level Interest Network in Reinforcement Learning for Ads Allocation

A mixed list of ads and organic items is usually displayed in feed and h...
08/25/2017

A deep reinforcement learning framework for allocating buyer impressions in e-commerce websites

We study the problem of allocating impressions to sellers in e-commerce ...
08/25/2017

Reinforcement Mechanism Design for e-commerce

We study the problem of allocating impressions to sellers in e-commerce ...
03/07/2019

Can Sophisticated Dispatching Strategy Acquired by Reinforcement Learning? - A Case Study in Dynamic Courier Dispatching System

In this paper, we study a courier dispatching problem (CDP) raised from ...
09/13/2020

The Platform Design Problem

On-line firms deploy suites of software platforms, where each platform i...

1. Introduction

Figure 1. The Meituan food delivery platform has multiple entrances for categories (e.g., Homepage, Food, Desert and so on), facilitating users to access different categories of content.

Ads and organic items are mixed together and displayed to users in e-commerce feed nowadays (yan2020LinkedInGEA; Ghose2009AnEA; li2020deep) and how to allocate limited ads slots reasonably and effectively to maximize platform revenue has gained growing attention (Wang2011LearningTA; Mehta2013OnlineMA; zhang2018whole). Recent dynamic ads allocation strategies model the problem as Markov Decision Process (MDP) (sutton1998introduction) and solve it using reinforcement learning (RL) (zhang2018whole; liao2021cross; zhao2019deep; Feng2018LearningTC; zhao2020jointly). For intance, xie2021hierarchical propose a hierarchical RL-based framework to first decide the type of the item to present and then determine the specific item for each slot. liao2021cross propose CrossDQN which takes the crossed items according to the action as input and allocates the slots in one screen at a time.

However, these excellent RL-based algorithms face one major challenge when applied in e-commerce platforms. As shown in Figure 1, the Meituan food delivery platform has multiple entrances for different categories (e.g., Homepage, Food, Desert and so on), facilitating users to access different categories of content. Some entrances have few visits, which can hardly support the learning of a good agent for ads allocation. Therefore, it is desirable for a learning algorithm to transfer samples and leverage knowledge acquired in data-rich entrance to improve performance of the agent for ads allocation in data-poor entrance.

Transfer learning (TL) in RL has been rarely used for ads allocation. Meanwhile, TL in RL for other scenarios (e.g., computer vision, natural language processing and other knowledge engineering areas

(zhu2020transfer; giannopoulos2021deep; tao2021repaint; liu2019value; tirinzoni2018importance)) has some limitations. For instance, tirinzoni2018importance present a algorithm called IWFQI for transferring samples in batch RL that uses importance weighting to automatically account for the difference in the source and target distributions. But IWFQI does not fully exploit possible similarities between tasks. liu2019value quantify the environmental dynamics of an MDP by the N-step return (NSR) values and present a knowledge transfer method called NSR-based value function transfer. However, they ignore the uncertainty of the NSR model itself when transferring, since the rewards can induce randomness in the observed long-term return (dabney2018distributional).

To this end, we present Similarity-based Hybrid Transfer for Ads Allocation (SHTAA)111The code and data example will be accessible after review., which can effectively transfer the samples and knowledge from the data-rich entrance to the learning of ads allocation agent in other data-poor entrance. Specifically, we quantify the environmental dynamics of an MDP by predicted distributional N-step return (NSR) values and define a concept of uncertainty-aware MDP similarity based on it. Then we design a hybrid transfer framework (consists of intance transfer and strategy transfer) based on the similarity value, which can effectively transfer the samples and knowledge from data-rich source task to data-poor target task. This is a meaningful attempt in applying transfer learning into ads allocation. We have conducted several offline experiments and evaluated our approach on real-world food delivery platform. The experimental results show that SHTAA can effectively improve the performance of the agent for ads allocation in data-poor entrance and obtain significant improvements in platform revenue.

2. Problem Formulation

On Meituan food delivery platform, we present slots in one screen and allocate for each screen in the feed of a request sequentially. So the ads allocation problem for different entrances can be formulated as a MDP (, , , , ), the elements of which are defined as follows:

  • [leftmargin=*]

  • State space . A state consists of the information of the candidate items (i.e., the ads sequence and the organic items sequence available on the -th step), the user (e.g., age, gender), and the context (e.g., order time, order location). For different entrances, we use the same set of features for different types of items to facilitate transfer.

  • Action space . An action is the decision whether to display an ad on each slot on the current screen, which is formulated as follows:

    (1)

    In our scenarios, the action spaces of different entrances are the same. Notice that, we do not change the order of the items within ads sequence and organic items sequence.

  • Reward . After the system takes an action on one state, a user browses the mixed list and gives a feedback. The reward is calculated based on the feedback and consists of ads revenue and service fees :

    (2)

    Besides, we define the N-step return (NSR) (liu2019value) after current state :

    (3)
  • Transition probability .

    is the state transition probability from

    to after taking the action , where is the index for the screen. When the user pulls down, the state transits to the state corresponding to the next screen . The items selected by for display will be removed from the candidate list on the next state . If the user no longer pulls down, the transition terminates.

  • Discount factor . The discount factor balances the short-term and long-term rewards.

We denote the dataset for source task as , and the dataset for target task as . Given the MDP formulated as above, the objective is to find an ads allocation policy for the target task based on and to maximize the cumulative reward.

3. Methodology

We will introduce the Similarity-based Hybrid Transfer for Ads Allocation (SHTAA) in detail. Two main ideas are: i) using predicted uncertain-aware N-step return (NSR) to measure MDP similarity, ii) proposing a hybrid transfer approach (consisting of instance transfer and strategy transfer) for selective transfer and avoidance of negative transfer.

3.1. Uncertainty-Aware MDP Similarity

we propose an uncertainty-aware MDP similarity concept in which the distribution over NSR is modeled explicitly instead of only estimating the mean.

We combinie high-dimensional representation capability of deep learning model and distribution modeling capability of Gaussian process (GP)

(rasmussen2003gaussian) to construct the NSR model , which outputs the distribution of the NSR of an MDP. Under a Bayesian perspective of function learning, we impose a prior distribution to the function and (approximately) infer the posterior distribution over functions as the learned model, which gives distributional NSR predictions and thus can provide uncertainty estimations along with the predictions.

The prior distribution expresses our prior belief, here we assume a GP prior distribution over , i.e., , where is the mean function and is the covariance kernel function. Similar to du2021exploration, we combine the basic GP kernel function with the deep learning model to generate the deep kernel function and use Sparse variational GP (SVGP) method (hensman2013gaussian; hensman2015scalable; titsias2009variational) to solve. Finally, we approximate posterior distribution. According to salimbeni2017doubly, we can calculate the predicted mean

and variance

of NSR. We denote the predicted distribution of NSR as follows:

(4)

where and are the predicted distributions from NSR models pre-trained with source and target dataset, respectively.

Subsequently, we calculate the uncertainty-aware MDP similarity based on the KL divergence between and . The similarity-based weight for a sample is calculated as follows:

(5)
Figure 2. The hybrid transfer includes instance transfer and strategy transfer. In instance transfer, we filter samples using similarity-based weight. In strategy transfer, we constrain the action space and guide the learning of target agent based on the pre-trained source agent. The red lines indicate gradient propagation.

3.2. Hybrid Transfer

The hybrid transfer method, which is shown in Figure 2, consists of instance transfer and strategy transfer. Next we will introduce each part separately.

3.2.1. Instance Transfer

The key challenge in instance transfer is to avoid the negative transfer. In this paper, we propose an instance transfer method based on the uncertainty-aware MDP similarity. Specifically, given a sample and a weight threshold , the local environmental dynamics of in the source task is regarded as similar to that in the target task if the corresponding similarity-based weight . In this case, this sample for the source task can be added directly into the mixed dataset. And when , which means that the local environmental dynamics related to in the source task and target task are different, this sample will be filtered out.222We have tried weighted transfer in instance transfer, the performances are close. In this way, the selective instance transfer based on uncertainty-aware MDP similarity can effectively avoid negative transfer.

3.2.2. Strategy Transfer

The pre-trained agent for the source task (hereinafter referred to as ) can well guide the learning of the agent for the target task (hereinafter referred to as ). Specifically, we first constrain the action space of ’s target value function based on the output Q-value of . The RL loss for is calculated as follows:

(6)

where is the set of actions corresponding to the top- highest and is transformed linearly from . By restricting the action space, the are forced to behave similar to .

Second, the output Q-value of can also guide the learning process of . So we take as the learning target of the , as follows:

(7)

And the similarity-based weight is used to adjust the extent of strategy transfer. For each iteration, we sample a batch of transitions from the mixed offline dataset and update the agent using gradient back-propagation w.r.t. the loss:

(8)

3.3. Offline Training

We show the process of offline training in Algorithm 1. We first pre-train the NSR models and . Then we train the through our hybrid transfer method.

1:Source dataset , target dataset (generated by an online exploratory policy )
2:pre-train
3:Train the NSR Model for source task on
4:Train the NSR Model for target task on
5:Train the on
6:train
7: Calculate the similarity-based weight for each sample
8: Filter samples in based on weight and threshold
9: Merge filtered and as
10:Initialize with random weights
11:
12: Sample a batch of from
13: Update network parameters by minimizing Loss in (8)
14: Convergence
Algorithm 1 Offline training of SHTAA

4. Experiments

4.1. Experimental Settings

4.1.1. Dataset

We collect the dataset by running an exploratory policy on Meituan food delivery platform during January 2022. The dataset contains 12,411,532 requests in the source entrance and 813,271 requests in the target entrance. Each request contains several transitions.

4.1.2. Evaluation Metrics

We evaluate with the ads revenue , the service fee . See the definition in Section 2.

4.1.3. Parameters Settings

The hidden layer sizes of the NSR models is . The structure and parameters of the agents follow the work in CrossDQN (liao2021cross). The learning rate is , the optimizer is Adam (kingma2014adam) and the batch size is 8,192.

4.2. Offline Experiment

In this section, we validate our method on offline data and evaluate the performance using an offline estimator. Through extended engineering, the offline estimator models the user preference and aligns well with the online service.

4.2.1. Baselines & Ablations

We study four baselines and three ablated variants to verifies the effectiveness of SHTAA.

  • [leftmargin=*]

  • CrossDQN (liao2021cross) is an advanced method for ads allocation. Here we take it as the structure of and train on .

  • Cross DQN (w/ ) transfers all samples in source dataset into the training of based on the previous baseline.

  • IWFQI (tirinzoni2018importance) is a algorithm for transferring samples in batch RL that uses importance weighting to automatically account for the difference in the source and target distributions.

  • NSR-CrossDQN. liu2019value propose the NSR-based value function transfer method.Here we implement this transfer method on and based on CrossDQN.

  • SHTAA (w/o UA-Sim) does not use uncertainty-aware MDP similarity and uses the MDP similarity concept defined in the NSR-based value function transfer method (liu2019value) instead.

  • SHTAA (w/o AC) does not use action constraint in SHTAA.

  • SHTAA (w/o ) does not use in SHTAA.

4.2.2. Performance Comparison

We present the experimental results in Table 1 and have the following findings: i) The performance of SHTAA is superior to CrossDQN and Cross DQN (w/ ), which mainly justifies that SHTAA can selectively transfer the samples and knowledge from source task to target task. ii) Compared with IWFQI, the superior performance of our method mainly justifies the effectiveness of our strategy transfer. iii) Compared with NSR-CrossDQN, the superior performance of our method mainly justifies the effectiveness of our uncertainty-aware MDP similarity concept and action constraint.

4.2.3. Ablation Study

To verify the impact of our designs, we study three ablated variants of our method and have the following findings: i) The performance of SHTAA is superior to CrossDQN, which verifies the effectiveness of all our designs. ii) The performance of SHTAA is superior to SHTAA (w/o UA-Sim), which mainly verifies the effectiveness of uncertainty-aware MDP similarity. iii) The performance gap between SHTAA and SHTAA (w/o AC) indicates the effectiveness of action constraint. iv) The performance gap between SHTAA and SHTAA (w/o ) indicates the guiding significance of .

4.2.4. Hyperparameter Analysis

We analyze the sensitivity of these two hyperparameters:

and . The optimal hyperparameter values are obtained by grid search and we have the following findings: i) is the step number of NSR. Due to the discount rate , the rewards sampled at larger time steps must contribute less to the NSR. Therefore, if an appropriate step number is chosen, the corresponding NSR may contain the most information about the local environmental dynamics. ii) is the weight threshold for instance transfer. A smaller would lead to more negative transfer while a larger would lead to less efficient transfer learning.

4.3. Online Results

We deploy the agent trained by SHTAA on Meituan food delivery platform’s data-poor entrance through online A/B test. As a result, we find that ads revenue and service fees increase by 2.72% and 3.31%, which demonstrates that our method greatly increases the platform revenue.

model
CrossDQN 0.4441 (0.0004) 0.2492 (0.0007)
CrossDQN (w/ ) 0.4450 (0.0005) 0.2492 (0.0009)
IWFQI 0.4452 (0.0004) 0.2505 (0.0007)
NSR-CrossDQN 0.4462 (0.0009) 0.2518 (0.0009)
SHTAA 0.4565 (0.0003) 0.2596 (0.0005)
 - w/o UA-Sim 0.4464 (0.0004) 0.2521 (0.0001)
 - w/o AC 0.4502 (0.0001) 0.2561 (0.0002)
 - w/o 0.4527 (0.0006) 0.2572 (0.0006)
Improvement 2.31% 3.12%
Table 1. The experimental results. Each experiment is presented in the form of mean standard deviation. The improvement means the improvements of our method across the best baselines.

5. Conclusions

In this paper, we propose Similarity-based Hybrid Transfer for Ads Allocation (SHTAA) to effectively transfer the samples and knowledge from data-rich source entrance to other data-poor entrance. Specifically, we present a novel uncertainty-aware MDP similarity concept in which the similarity is calculated based on the predicted NSR and the distribution over NSR is modeled explicitly instead of only estimating the mean. Based on the similarity, we design a hybrid transfer method which consists of intance transfer and strategy transfer to effectively transfer the samples and knowledge from source task. Practically, both offline experiments and online A/B test have demonstrated the superior performance and efficiency of our method.

References