Ads and organic items are mixed together and displayed to users in e-commerce feed nowadays (yan2020LinkedInGEA; Ghose2009AnEA; li2020deep) and how to allocate limited ads slots reasonably and effectively to maximize platform revenue has gained growing attention (Wang2011LearningTA; Mehta2013OnlineMA; zhang2018whole). Recent dynamic ads allocation strategies model the problem as Markov Decision Process (MDP) (sutton1998introduction) and solve it using reinforcement learning (RL) (zhang2018whole; liao2021cross; zhao2019deep; Feng2018LearningTC; zhao2020jointly). For intance, xie2021hierarchical propose a hierarchical RL-based framework to first decide the type of the item to present and then determine the specific item for each slot. liao2021cross propose CrossDQN which takes the crossed items according to the action as input and allocates the slots in one screen at a time.
However, these excellent RL-based algorithms face one major challenge when applied in e-commerce platforms. As shown in Figure 1, the Meituan food delivery platform has multiple entrances for different categories (e.g., Homepage, Food, Desert and so on), facilitating users to access different categories of content. Some entrances have few visits, which can hardly support the learning of a good agent for ads allocation. Therefore, it is desirable for a learning algorithm to transfer samples and leverage knowledge acquired in data-rich entrance to improve performance of the agent for ads allocation in data-poor entrance.
To this end, we present Similarity-based Hybrid Transfer for Ads Allocation (SHTAA)111The code and data example will be accessible after review., which can effectively transfer the samples and knowledge from the data-rich entrance to the learning of ads allocation agent in other data-poor entrance. Specifically, we quantify the environmental dynamics of an MDP by predicted distributional N-step return (NSR) values and define a concept of uncertainty-aware MDP similarity based on it. Then we design a hybrid transfer framework (consists of intance transfer and strategy transfer) based on the similarity value, which can effectively transfer the samples and knowledge from data-rich source task to data-poor target task. This is a meaningful attempt in applying transfer learning into ads allocation. We have conducted several offline experiments and evaluated our approach on real-world food delivery platform. The experimental results show that SHTAA can effectively improve the performance of the agent for ads allocation in data-poor entrance and obtain significant improvements in platform revenue.
2. Problem Formulation
On Meituan food delivery platform, we present slots in one screen and allocate for each screen in the feed of a request sequentially. So the ads allocation problem for different entrances can be formulated as a MDP (, , , , ), the elements of which are defined as follows:
State space . A state consists of the information of the candidate items (i.e., the ads sequence and the organic items sequence available on the -th step), the user (e.g., age, gender), and the context (e.g., order time, order location). For different entrances, we use the same set of features for different types of items to facilitate transfer.
Action space . An action is the decision whether to display an ad on each slot on the current screen, which is formulated as follows:
In our scenarios, the action spaces of different entrances are the same. Notice that, we do not change the order of the items within ads sequence and organic items sequence.
Reward . After the system takes an action on one state, a user browses the mixed list and gives a feedback. The reward is calculated based on the feedback and consists of ads revenue and service fees :
Besides, we define the N-step return (NSR) (liu2019value) after current state :
Transition probability .
is the state transition probability fromto after taking the action , where is the index for the screen. When the user pulls down, the state transits to the state corresponding to the next screen . The items selected by for display will be removed from the candidate list on the next state . If the user no longer pulls down, the transition terminates.
Discount factor . The discount factor balances the short-term and long-term rewards.
We denote the dataset for source task as , and the dataset for target task as . Given the MDP formulated as above, the objective is to find an ads allocation policy for the target task based on and to maximize the cumulative reward.
We will introduce the Similarity-based Hybrid Transfer for Ads Allocation (SHTAA) in detail. Two main ideas are: i) using predicted uncertain-aware N-step return (NSR) to measure MDP similarity, ii) proposing a hybrid transfer approach (consisting of instance transfer and strategy transfer) for selective transfer and avoidance of negative transfer.
3.1. Uncertainty-Aware MDP Similarity
we propose an uncertainty-aware MDP similarity concept in which the distribution over NSR is modeled explicitly instead of only estimating the mean.
We combinie high-dimensional representation capability of deep learning model and distribution modeling capability of Gaussian process (GP)(rasmussen2003gaussian) to construct the NSR model , which outputs the distribution of the NSR of an MDP. Under a Bayesian perspective of function learning, we impose a prior distribution to the function and (approximately) infer the posterior distribution over functions as the learned model, which gives distributional NSR predictions and thus can provide uncertainty estimations along with the predictions.
The prior distribution expresses our prior belief, here we assume a GP prior distribution over , i.e., , where is the mean function and is the covariance kernel function. Similar to du2021exploration, we combine the basic GP kernel function with the deep learning model to generate the deep kernel function and use Sparse variational GP (SVGP) method (hensman2013gaussian; hensman2015scalable; titsias2009variational) to solve. Finally, we approximate posterior distribution. According to salimbeni2017doubly, we can calculate the predicted mean
and varianceof NSR. We denote the predicted distribution of NSR as follows:
where and are the predicted distributions from NSR models pre-trained with source and target dataset, respectively.
Subsequently, we calculate the uncertainty-aware MDP similarity based on the KL divergence between and . The similarity-based weight for a sample is calculated as follows:
3.2. Hybrid Transfer
The hybrid transfer method, which is shown in Figure 2, consists of instance transfer and strategy transfer. Next we will introduce each part separately.
3.2.1. Instance Transfer
The key challenge in instance transfer is to avoid the negative transfer. In this paper, we propose an instance transfer method based on the uncertainty-aware MDP similarity. Specifically, given a sample and a weight threshold , the local environmental dynamics of in the source task is regarded as similar to that in the target task if the corresponding similarity-based weight . In this case, this sample for the source task can be added directly into the mixed dataset. And when , which means that the local environmental dynamics related to in the source task and target task are different, this sample will be filtered out.222We have tried weighted transfer in instance transfer, the performances are close. In this way, the selective instance transfer based on uncertainty-aware MDP similarity can effectively avoid negative transfer.
3.2.2. Strategy Transfer
The pre-trained agent for the source task (hereinafter referred to as ) can well guide the learning of the agent for the target task (hereinafter referred to as ). Specifically, we first constrain the action space of ’s target value function based on the output Q-value of . The RL loss for is calculated as follows:
where is the set of actions corresponding to the top- highest and is transformed linearly from . By restricting the action space, the are forced to behave similar to .
Second, the output Q-value of can also guide the learning process of . So we take as the learning target of the , as follows:
And the similarity-based weight is used to adjust the extent of strategy transfer. For each iteration, we sample a batch of transitions from the mixed offline dataset and update the agent using gradient back-propagation w.r.t. the loss:
3.3. Offline Training
We show the process of offline training in Algorithm 1. We first pre-train the NSR models and . Then we train the through our hybrid transfer method.
4.1. Experimental Settings
We collect the dataset by running an exploratory policy on Meituan food delivery platform during January 2022. The dataset contains 12,411,532 requests in the source entrance and 813,271 requests in the target entrance. Each request contains several transitions.
4.1.2. Evaluation Metrics
We evaluate with the ads revenue , the service fee . See the definition in Section 2.
4.1.3. Parameters Settings
The hidden layer sizes of the NSR models is . The structure and parameters of the agents follow the work in CrossDQN (liao2021cross). The learning rate is , the optimizer is Adam (kingma2014adam) and the batch size is 8,192.
4.2. Offline Experiment
In this section, we validate our method on offline data and evaluate the performance using an offline estimator. Through extended engineering, the offline estimator models the user preference and aligns well with the online service.
4.2.1. Baselines & Ablations
We study four baselines and three ablated variants to verifies the effectiveness of SHTAA.
CrossDQN (liao2021cross) is an advanced method for ads allocation. Here we take it as the structure of and train on .
Cross DQN (w/ ) transfers all samples in source dataset into the training of based on the previous baseline.
IWFQI (tirinzoni2018importance) is a algorithm for transferring samples in batch RL that uses importance weighting to automatically account for the difference in the source and target distributions.
NSR-CrossDQN. liu2019value propose the NSR-based value function transfer method.Here we implement this transfer method on and based on CrossDQN.
SHTAA (w/o UA-Sim) does not use uncertainty-aware MDP similarity and uses the MDP similarity concept defined in the NSR-based value function transfer method (liu2019value) instead.
SHTAA (w/o AC) does not use action constraint in SHTAA.
SHTAA (w/o ) does not use in SHTAA.
4.2.2. Performance Comparison
We present the experimental results in Table 1 and have the following findings: i) The performance of SHTAA is superior to CrossDQN and Cross DQN (w/ ), which mainly justifies that SHTAA can selectively transfer the samples and knowledge from source task to target task. ii) Compared with IWFQI, the superior performance of our method mainly justifies the effectiveness of our strategy transfer. iii) Compared with NSR-CrossDQN, the superior performance of our method mainly justifies the effectiveness of our uncertainty-aware MDP similarity concept and action constraint.
4.2.3. Ablation Study
To verify the impact of our designs, we study three ablated variants of our method and have the following findings: i) The performance of SHTAA is superior to CrossDQN, which verifies the effectiveness of all our designs. ii) The performance of SHTAA is superior to SHTAA (w/o UA-Sim), which mainly verifies the effectiveness of uncertainty-aware MDP similarity. iii) The performance gap between SHTAA and SHTAA (w/o AC) indicates the effectiveness of action constraint. iv) The performance gap between SHTAA and SHTAA (w/o ) indicates the guiding significance of .
4.2.4. Hyperparameter Analysis
We analyze the sensitivity of these two hyperparameters:and . The optimal hyperparameter values are obtained by grid search and we have the following findings: i) is the step number of NSR. Due to the discount rate , the rewards sampled at larger time steps must contribute less to the NSR. Therefore, if an appropriate step number is chosen, the corresponding NSR may contain the most information about the local environmental dynamics. ii) is the weight threshold for instance transfer. A smaller would lead to more negative transfer while a larger would lead to less efficient transfer learning.
4.3. Online Results
We deploy the agent trained by SHTAA on Meituan food delivery platform’s data-poor entrance through online A/B test. As a result, we find that ads revenue and service fees increase by 2.72% and 3.31%, which demonstrates that our method greatly increases the platform revenue.
|CrossDQN||0.4441 (0.0004)||0.2492 (0.0007)|
|CrossDQN (w/ )||0.4450 (0.0005)||0.2492 (0.0009)|
|IWFQI||0.4452 (0.0004)||0.2505 (0.0007)|
|NSR-CrossDQN||0.4462 (0.0009)||0.2518 (0.0009)|
|SHTAA||0.4565 (0.0003)||0.2596 (0.0005)|
|- w/o UA-Sim||0.4464 (0.0004)||0.2521 (0.0001)|
|- w/o AC||0.4502 (0.0001)||0.2561 (0.0002)|
|- w/o||0.4527 (0.0006)||0.2572 (0.0006)|
In this paper, we propose Similarity-based Hybrid Transfer for Ads Allocation (SHTAA) to effectively transfer the samples and knowledge from data-rich source entrance to other data-poor entrance. Specifically, we present a novel uncertainty-aware MDP similarity concept in which the similarity is calculated based on the predicted NSR and the distribution over NSR is modeled explicitly instead of only estimating the mean. Based on the similarity, we design a hybrid transfer method which consists of intance transfer and strategy transfer to effectively transfer the samples and knowledge from source task. Practically, both offline experiments and online A/B test have demonstrated the superior performance and efficiency of our method.