1 Introduction
Realtime bidding (RTB) is a leading online ad inventory trading mechanism in which each ad display is sold through realtime auctions. It allows the advertisers to target potential users with click or purchase interests on the impression level. In RTB, each bidding agent is involved in a highly dynamic bidding environment with an unknown number of competitors. In realtime bidding, second price auctions [Krishna2009]
are usually held where the winner is the one with the highest bid and pays the second highest price. In theory, to reach the Nash equilibrium of the static second price auction, bidders are encouraged to submit their estimation of impression’s true value as the bid price
[Krishna2009]. However, in practice, the market may not always maintain the ideal equilibrium due to various reasons. For instance, bidders are constrained by their budget. To avoid running out of money quickly without observing more valuable impressions, the optimal bid price usually deviates from its true value. In addition, the number of participants in each auction is unknown and from each bidder’s perspective, it may compete with different opponents at every step during its lifetime. To obtain the optimal bidding strategy in such a stochastic game with a large number of unknown participants is the major challenge in RTB.The dynamics in the RTB market mostly come from the hybrid behavior of each bidder. It is important for an intelligent bidding agent to inference its opponents’ strategy while optimizing its own strategy. In [He et al.2016]
, a deepQNetwork (DQN) based multitask reinforcement learning architecture is proposed to jointly learn a policy for an agent and the behavior of its opponents in a multiplayer game. The model requires features from opponents as input. Similar ideas have been studied in the RTB domain
[Jin et al.2018], the authors investigate the optimal bidding strategy in a fully observable multiagent bidding environment, where each agent knows each other’s budget and obtained reward at every step. However, in practice, each bidder is not aware of the configuration of its competitors and the competitor set varies in every auction.With the number of opponents increasing, modeling every opponent’s action becomes implausible and computationally expensive. In addition, given the nature of the second price auctions, only part of the opponent’s actions can be observed by each agent. To analyze such highly dynamic games with incomplete information and large number of participants, the Mean Field Theory [Stanley1971] has been employed. A scalable policy learning solution for multiplayer games is proposed in [Yang et al.2018]. The core idea is to find the optimal actions for one agent in response to the mean action of its neighbors. In this way, instead of modeling actions from all the agents in the environment, the mean action of neighbors represents the action distribution of all neighboring agents. In addition, in [Gummadi et al.2012, Iyer et al.2011], the authors extensively analyzed and proved the existence of the Mean Field Equilibrium (MFE) in the dynamic bidding environment.
Inspired by the prior studies, we firstly address the partially observable opponent actions by adopting a Deep Attentive Survival model which greatly outperforms the stateoftheart survival models. Furthermore, our solution integrates the opponent model into the policy learning framework for actorcritic based bidding agents. We take the second highest price as the aggregated action of all the other bidders, which enables each bidder to optimize its policy together with modeling the uncertainty of the market. Our experiments have shown the equilibrium under different budget constraints and faster convergence in the multiagent environment.
2 Related Work
Bid optimization is one of the key components in the decision making process in RTB. It aims to optimize the bid price on the impression level which maximizes the potential profits under a certain budget constraint. Many research works have formulated it as a functional optimization problem [Perlich et al.2012, Zhang et al.2014a] However, the functional based methods have strong assumptions of the model form and fail to incorporate the dynamics in the bidding environment and the bidder’s budget spending status into the model.
To address the above shortage in the prior studies, much research efforts have been focused on Reinforcement Learning (RL) based RTB [Cai et al.2017, Du et al.2017, Wu et al.2018]. These studies mainly address the bidding optimization problem for a single agent, neglecting the stochastic behaviors of other bidders in the market. jin2018real jin2018real and zhao2018deep zhao2018deep extend the single agent learning in RTB to a multiagent bidding scenario. jin2018real jin2018real adopt the Deep Deterministic Policy Gradients (DDPG) algorithm on the advertisers cluster level and demonstrate the profit gain per bidder under the competing or collaborating reward settings. However, one strong assumption in this work is that each agent knows each other’s state. In practice, the only information each bidder has about other opponents are the market price in case of winning. In the second price auctions, the winning price of the lost auctions are censored which makes the bidding environment to be partially observable. In this case, the bidder only knows the market price should be higher than its own bid price.
To address the censorship problem, survival model has been widely used in the medical domain to conduct the timetodeath analysis [Miller Jr2011]. The market price estimation has been commonly addressed by adopting a nonparametric KaplanMeier estimator on an aggregated level [Wang et al.2016]. However, one aggregated distribution for all the bid requests fails to capture the divergence in the data. In [Ren et al.2019]
, the authors proposed to adopt the recurrent network to model the sequential pattern in the feature space of the individual user and directly estimate the market price probability. However, the features may not only limited to the sequential dependencies.
The Transformer [Vaswani et al.2017]
is the first model relying entirely on selfattention to compute a representations of its input without using convolutions or sequence aligned recurrent neural networks
[Graves et al.2013]. It has fueled much of the latest advancements, such as pretrained contextualized word embeddings [Peters et al.2018, Devlin et al.2018, Radford et al.2018]crucial to the success of sequential tasks in natural language processing. In this work, we adopt the transformer model as a nonlinear approximation to the survival function as our opponent model.
In the multiagent stochastic game, for each agent, it is essential to model its opponent’s actions. The opponent actions are usually either modeled as i.i.d [Brown1951] or as sequential actions with shortterm history [Mealing and Shapiro2013]. hernandez2017learning hernandez2017learning assume the opponent redraw strategies during a twoplayers repeated stochastic game and the agent updates the belief of the opponent model by its observations. Similar to the repeated stochastic game setting in [HernandezLeal and Kaisers2017], our work focuses on repeated second price auctions of the same ad campaign with unknown opponents. The key difference is that we model RTB auctions as a multiplayer stochastic game and the opponents are not restricted to have limited memory bounding.
In the repeated auctions, the existence of Mean Field Equilibrium (MFE) under budget constraints has been theoretically proved in [Iyer et al.2014, Gummadi et al.2012]. Both studies showed that the value function for an agent to reach the MFE takes the known fixed market price, budget, and the observed utility distribution, for example, the estimated click through rate (CTR) as inputs. In practice, the market price is only partially observable to each bidder. In addition, the conventional RL learning algorithms like DDPG, do not explicitly model the opponent’ss strategy. Therefore, in our work, we firstly extend the MFE setting into the Q function estimation in the DDPG algorithm. The opponent model is integrated into Q values using an indicator function, enforcing gradients to only flow through the actions which will result in a reward over the estimated opponents actions. We demonstrate the performance improvements of the optimal MFE strategy from a single agent’s perspective.
3 Problem Formulation
In this section, we formulate the sequential second price auctions as a multiplayer stochastic game. Under the classic RL setting, real time bidding process is usually formalized as a Markov Decision Process (MDP)
[Cai et al.2017], which is defined by four elements . A state describes the status of the agent at step . The bid price is usually considered as the action to take. However, the bid price can be set to any number in the range of . To generalize and limit the range of the action , like in [Jin et al.2018], in this paper, is normalized ranging from . The optimal policy is a mapping from to which optimizes the reward function of the agent, in which . The transition functiondefines the probability distributions over the state space:
.Different from the conventional RL, where a single agent learns to react to the environment, a stochastic game describes the strategic reaction of all the agents in the environment. In such games, all the agents take actions at the same time, and their actions influence the complex change of the environment. As defined in a stochastic game, at each time step, all the agents choose their part of the joint action , where is the overall game state. In a two players game, for example, the joint action is defined as where is the action of player and is the action of its opponent. The immediate reward is represented as . Correspondingly, the transition probability becomes , where is the next joint state of all agents.
In RTB, from a single bidding agent’s perspective, each auction involves an unknown number of other bidders, a.k.a opponents. Each bidder has different budget and target preferences. Therefore, in every auction, the set of opponents are highly stochastic. The winner of the auction only observes the highest price from its opponents regardless the bid prices from other bidders. In this way, the stochastic formulation of RTB can be greatly simplified as a two players game. The winning price can be modeled as the joint action from all the other opponents and is partially observable. Unlike other games where the status of the opponents are usually can be seen, in RTB auctions, the opponent’s attributes remains unknown.
In the ideal MFE scenario, it supposes that all the agents take a fixed and steady bid distribution and their own belief of the bid valuation as the prior knowledge to optimize their strategy[Iyer et al.2014]. The policy that each agent followed is stationary. In practice, the bid valuation is estimated by the CTR prediction model and the opponent bid distribution is the market price model. Thus, in this work, we adopted two pretrained models into the framework to fulfill the above assumption.
We consider the bidding process as an episodic task and each episode consists of auctions. Each episode has a fixed budget CPM_{train} , where CPM_{train} is the cost per mille impressions in the training data and is the budget constraint ratio.
Fig. 1
depicts the architecture of the components used in this work. The CTR model takes the feature vector
x in the historical bid requests as input and binary labels 0 and 1 indicating impression and click respectively. The predicted Click Through Rate (pCTR) is later used to construct the agent state and the reward in the DDPG with Opponent Modeling (DDPGOM) model. In the following sections, the opponent model and the DDPGOM model are described in details.3.1 Opponent Modeling
In this section, we focus on modeling the opponents actions, a.k.a the market price distribution. The opponent model is defined as the market price distribution at an impression level. We use to represent the action taken by the opponents, a.k.a the highest price from all the other participants in the auction. In this study, , where
is the market price. The probability density function (P.D.F) of
is .(1) 
As is shown in Eq. 1, the P.D.F of the market price can be calculated from the instant hazard function which indicates the probability of the instant occurrence of the event at time conditioned on the event has not happened prior to time . In the RTB setting, represents the losing probability of bidding less than the market price and shows the probability of observing the market price .
We take the features in the bid request as input and predict the hazard function over the discretized bid price space at each impression level. The can be easily derived from Eq. 1
. For the uncensored data, the true label is an onehot encoded vector of size
with the element indexed by the market price as 1.We followed the loss functions in
[Ren et al.2019], for the uncensored data, the loss of the observed market price is defined as:For the censored data, it is certain to still lose the auction by bidding lower than the current price. The corresponding loss is defined as:
In addition, for the winning auctions, by bidding at any price higher than the observed market price, it is guaranteed to win the auction. Such information can be shared with the censored data. The loss function is defined as followed:
The total loss of the model takes the combination of the above losses as below where balances the loss values.
3.2 Bidding Model
Under our repeated secondprice auctions setting, in every auction, all the agents are facing the same bid request. The agents bid for the same ad campaign upon different requests with unknown number of opponents at each auction. The RL agent adopts the framework of Deep Deterministic Policy Gradient (DDPG) [Lillicrap et al.2015] method to learn the policy in a continuous space.
State. For the DDPG agent, we take the budget left in an episode and the as the state
Action. Following the settings in [Jin et al.2018], the action is set to be a scaler which controls the bid price and is bounded to be in the range of . The final bid price is calculated by , where is the upper bound of the bid price. The market price or the aggregated actions from the opponents are denoted as .
Reward. The reward is usually the Key Performance Indicator (KPI) defined by the advertisers, for instance, a click, a purchase or the profits. But such reward signal is usually too sparse for the agent to learn. Therefore, in this study, we assign the as the reward for all the winning auctions, even without the real click. For the losing auctions, since no price is paid, the reward remains as zero. For agent , the actor network takes state , which consists of the predicted CTR and the budget left in the current episode, and parameterized with for a deep neural network which provides an action to take in the range of .
Action function
(2) 
In the vanilla version of DDPG algorithm, the critic function takes the state and action pair from a single agent. In our model, the Qfunction is approximated by the mean field theory by integrating the opponent’s action distribution. As is shown in Eq. 3, is the market distribution obtained from the opponent model. The action is not directly observed from the environment, since the result of the auction can only be see after placing a bid price. The market distribution provides the agent’s belief of the opponents actions. The indicator function allows the agent to account for the Q value only in the case of bidding higher than the market price. Since when the action is lower than , the agent cannot win such auctions, thus, the Q value should be zero.
Critic function
(3) 
The pseudo code of the DDPG with OpponentModel (DDPGOM) algorithm is shown in Alg. 1.
3.3 Single Agent Steady Market Distribution
We start from the simplest scenario: a single agent bids against the steady market price distribution. In this setting, we assume the linear bidders have fixed strategies which means they do not update their strategies upon other bidders’ actions. In addition, given the dynamic attributes of the bidders, from agent ’s point of view, the bids from its opponents are identically and independently distributed. As we discussed in 1, in practice, the opponent sets in every auction changes over time. Here we assume the departure and the arrival rate of bidders remains steady, which guarantees the stationary of the bid price distribution of the opponents. Even that the opponent bids are partially observable, this allows us to approximate a fixed opponent model and use it in the mean field model.
3.4 MultiAgent Mean Field Approximation
From the above single agent scenario, here we extend to discuss the mean field equilibrium. Instead of considering only one agent, we focus on the multi agent environment, where all agents assume to share one steady bid distribution . Taken as the prior knowledge, each agent optimizes their bidding strategy which in turn induces dynamics in the overall bid distribution. The mean field equilibrium requires a consistency check of the bid distribution [Iyer et al.2014]. Let be a bid distribution and denote a stationary policy for an agent facing bidding decision. The mean field equilibrium is achieved if it satisfies the following definition:
Definition 1
The repeated secondprice auction Mean Field games admit at least one mean field equilibrium (MFE)[Iyer et al.2014], with strategy , if:

is an optimal strategy given .

is the steady state bid distribution given .
In this game, we assume that the number of competing agents is large. For each auction, a finite number of agents is randomly selected (through Gaussian noise randomly selected a competing agent, the agent with the largest noise added one is effectively selected). Each agent has a random lifetime, which is exponentially distributed with unit mean. It optimizes the utility over its lifetime. The unit mean is effectively the fact that each agent starts with the same budget, however the varying lifetime depends on the pCTR values estimated and also the exponentially distributed additive noise. At either the end of the episode or when the budget has been exhausted, agents are replaced by new ones whose initial budget, valuation distribution and income is sampled. In most experiments, instead of randomly sampling budget we initialized this to the same value, and noticed no difference in convergence guarantees. Due to a learning rate decay, eventually the DDPG agent will converge to a stationary agent (learning rate
), thus the normal theorem by [Iyer et al.2014] holds.In the MFE, each bidder are facing i.i.d highest opponent bids and has no incentive to change her bidding strategy. However, it is important to note that before the equilibrium is reached, the bid distribution would change as the market evolves. Thus it is important for the agents to infer the bid distribution over time.
4 Experiments
In this work, the experiments are conducted over the public realworld dataset, iPinYou, one of the leading ad companies in China. The dataset contains the original bid logs and the labels of click and purchase. We follow the data preprocessing and feature engineering procedure in [Zhang et al.2014b]. Since in the iPinYou dataset, it records the original market price of the impressions, we initiate all the agents with the budget to be proportional to the total cost in the training data: = 1/16, 1/8, and 1/4. In this way, it allows us to simulate the auctions offline. Given each bid request, each agent in the environment places a bid price and follows the secondprice auction principles ^{2}^{2}2The experiment code will be available for the final version. The original market price in the log is not included in the environment.
In Fig 1
, the CTR estimator is trained offline by adopting the widely used FTRLlogistic regression model
[McMahan et al.2013]. In both single and multi agent scenarios, we begin with running the bidding simulation over the training set and log the bid price of each agent and select the second highest price as the market price. The opponent model in Figure 1 takes the simulated bid log and the features in the original bid requests as input to predict the impression level market distribution as described in Section 3.1.Once the CTR model and the opponent model are trained, we repeat the bidding simulation on both train and test set. In this round, the DDPG agent learns the policy while having the prediction of the distribution of its opponent. We begin with setting one DDPG agent in the environment and keep the other bidders using simple and static bidding strategies, for instance, linear bidding function. In this setting, we demonstrate the advantage of the learning agent over the static agent without the learning process. Furthermore, we extend to the multiagent scenario where all the agents learn their strategies with its estimated opponent model.
In this study, every 1000 auctions is defined as one epoch. The budget resets at the beginning of each epoch.
4.1 Opponent Model
In this section, we compare the general behaviour of the DASA model with other survival analysis models. The experiments are conducted on 3 datasets: Clinic[Knaus et al.1995], Music[J. and S.2017], and Bidding dataset. The statistics of the datasets can be found in [Ren et al.2019]. The data is processed and the results with * in Table 1 are reproduced in the same way by using their publicly available code ^{3}^{3}3https://github.com/rk2900/drsa. and datasets ^{4}^{4}4https://www.dropbox.com/s/q5x1q0rnqs7otqn/drsadata.zip?dl=0.
and served as baselines in this study. The evaluation metric is the average negative log probability (ANLP) of the market price, which corresponds to the true market price likelihood loss. The result shows the DASA model significantly outperforms other methods across all three datasets. Thus it is selected as our opponent model for the experiments in the following sections.
Models  ANLP  

CLINIC  MUSIC  BIDDING  
KM*  9.012  7.270  14.012 
LassoCox*  5.307  28.983  34.941 
Gamma*  4.610  6.326  5.941 
STM*  3.780  5.707  4.977 
MTLSA*  17.759  25.121  9.979 
DeepSurv*  5.345  29.002  35.405 
DeepHit*  5.027  5.523  5.513 
DRSA*  3.337  5.132  4.598 
DASA (One Stack Transformer)  2.786  4.912  3.465 
4.2 Single DDPG agent with Steady Market Distribution
In this section, we assume that there is one learning agent running DDPG algorithm and competing against bidders with fixed strategies, for example, a linear bidding function. In practice, is always unknown and for each auction, a random set of the bidders is selected. In addition, the set of bidders may at different stage of their lifetime with different budget left. In the secondprice auction, the most important opponent is the bidder with the second highest price among all the bidders. In this study, we set up two linear bidders and one DDPG bidder. The bidders take the same pCTR from the CTR prediction model. By injecting Gaussian noises into the pCTR, we simulate the stochastic environment of random bidders with different at each auction. The bid price of the DDPG agent are compared with the price generated by the linear bidders and the new market price is logged and used as the input for training the survival model. The survival model is trained offline which takes the features in the bid log and the winning or losing signal from the environment as described in Sec.4.1.
In the next round, we replay the bidding game again to train the same DDPG agent from scratch with the opponent model integrated. In Figure 2, the number of clicks obtained by the three bidders are listed for one selected ad campaign, 2259.
In Figure 3, it shows the number of impressions each agent won in each epoch. The rows represents three budget settings, where = 1/8, 1/4, and 1/2. The left column are the results from DDPG agent without opponent model as the baseline while the right column shows the DDPG agent with opponent model. By having the opponent model, the DDPG agent starts dominating the bidding game. Since the other agents are unaware of the market change, they always bid proportionally to the predicted CTR.
We need to note that, the budget was set by referring to the original market price in the iPinyou bidding log. But the new market price generated by the agents are different and lower. Thus, it is the reason that for some campaign like 3358, and 2821, even with , it is sufficient for the DDPG agent to dominate the other linear bidders without the opponent model. When the DDPG dominates the game, the market price model is approximately fully observable, thus, the gain becomes insignificant. At the beginning of their campaign lifetime, without any information of the market, the learning agent converges to a steady but suboptimal strategy. However, if they infer the opponents model quickly and the opponents have fixed strategy, the DDPGOM model facilities the bidder to converge to a more dominant strategy in the market. If the other agents adopt learning process into their strategies which evolves the bid distribution, the challenge would be to show the asynchronous best response from all the agents and converge to the MFE which is shown in the next section.
4.3 Multiagent game
In this section, the experiment is extended to have multiple learning agents in the same environment. As is shown in the first row in Figure 4
, the 3 agents start with bidding by only learning from its own reward without referring to other bidders’ behaviour. After 200 epochs, the game converge to the equilibrium where the number of impressions won by each agent roughly evenly distributed. In the second row, the market distribution per impression is sampled from an uniform distribution. In this case, it increased the variance of the number of the impressions won in each episode and some agent may converge to a dominating strategy. We take the bidding log generated by the first game and trained a market model separately for each agent based on the set of impressions they won. With the information of the market, we reset the game for the training set, as is shown in the third row in Figure
4. The agents converge the optimal strategies within 100 epochs which is 50% less than the results in the first row. We further test the model on the test set, which shows the model is generalized well and the equilibrium is reached.5 Conclusions
In this paper, we propose a general opponent aware bidding algorithm with no prior assumptions on the opponents bidding distribution. To the best of our knowledge, it is the first experimental implementation in the realtime bidding domain to infer the partially observable opponents in the policy learning process. We proposed a deep attentive survival model as the impression level opponent model. The multiagent bidding simulations show the benefits of improved convergence rates for the DDPG model across all budgets with augmented with an opponent model. For the future work, we will investigate the online training for the opponent model in the multiagent bidding game.
Acknowledgments
This research has been supported by the National Research Fund (FNR) of Luxembourg under the AFR PPP scheme. The experiments presented in this paper were carried out using the HPC facilities of the University of Luxembourg [VBCG_HPCS14] – see https://hpc.uni.lu, and the facilities provided by MediaGamma Ltd.
References
 [Brown1951] G. W. Brown. Iterative solution of games by fictitious play. Activity analysis of production and allocation, 13(1), 1951.
 [Cai et al.2017] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo. Realtime bidding by reinforcement learning in display advertising. In WSDM. ACM, 2017.
 [Devlin et al.2018] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 [Du et al.2017] M. Du, R. Sassioui, G. Varisteas, M. Brorsson, O. Cherkaoui, and R. State. Improving realtime bidding using a constrained markov decision process. In ADMA. Springer, 2017.
 [Graves et al.2013] A. Graves, A.r. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013.
 [Gummadi et al.2012] R. Gummadi, P. Key, and A. Proutiere. Repeated auctions under budget constraints: Optimal bidding strategies and equilibria. In the Eighth Ad Auction Workshop, 2012.

[He et al.2016]
H. He, J. BoydGraber, K. Kwok, and H. Daumé III.
Opponent modeling in deep reinforcement learning.
In
International Conference on Machine Learning
, 2016.  [HernandezLeal and Kaisers2017] P. HernandezLeal and M. Kaisers. Learning against sequential opponents in repeated stochastic games. In The 3rd Multidisciplinary Conference on Reinforcement Learning and Decision Making, Ann Arbor, 2017.
 [Iyer et al.2011] K. Iyer, R. Johari, and M. Sundararajan. Mean field equilibria of dynamic auctions with learning. ACM SIGecom Exchanges, 10(3), 2011.
 [Iyer et al.2014] K. Iyer, R. Johari, and M. Sundararajan. Mean field equilibria of dynamic auctions with learning. Management Science, 60(12), 2014.
 [J. and S.2017] H. J. and A. J. S. Neural survival recommender. In WSDM, 2017.
 [Jin et al.2018] J. Jin, C. Song, H. Li, K. Gai, J. Wang, and W. Zhang. Realtime bidding with multiagent reinforcement learning in display advertising. arXiv preprint arXiv:1802.09756, 2018.
 [Knaus et al.1995] W. A. Knaus, F. E. Harrell, J. Lynn, L. Goldman, R. S. Phillips, A. F. Connors, N. V. Dawson, W. J. Fulkerson, R. M. Califf, N. Desbiens, and others. The support prognostic model: objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine, 122(3):191–203, 1995.
 [Krishna2009] V. Krishna. Auction theory. Academic press, 2009.
 [Lillicrap et al.2015] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
 [McMahan et al.2013] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, and others. Ad click prediction: a view from the trenches. In KDD. ACM, 2013.

[Mealing and
Shapiro2013]
R. Mealing and J. L. Shapiro.
Opponent modelling by sequence prediction and lookahead in twoplayer
games.
In
International Conference on Artificial Intelligence and Soft Computing
. Springer, 2013.  [Miller Jr2011] R. G. Miller Jr. Survival analysis, volume 66. John Wiley & Sons, 2011.
 [Perlich et al.2012] C. Perlich, B. Dalessandro, R. Hook, O. Stitelman, T. Raeder, and F. Provost. Bid optimizing and inventory scoring in targeted online advertising. In KDD. ACM, 2012.
 [Peters et al.2018] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.

[Radford et al.2018]
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever.
Improving language understanding with unsupervised learning.
Technical report, Technical report, OpenAI, 2018.  [Ren et al.2019] K. Ren, J. Qin, L. Zheng, Z. Yang, W. Zhang, L. Qiu, and Y. Yu. Deep recurrent survival analysis. AAAI, 2019.
 [Stanley1971] H. E. Stanley. Phase transitions and critical phenomena. Clarendon Press, Oxford, 1971.
 [Vaswani et al.2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
 [Wang et al.2016] Y. Wang, K. Ren, W. Zhang, J. Wang, and Y. Yu. Functional bid landscape forecasting for display advertising. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016.
 [Wu et al.2018] D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai. Budget constrained bidding by modelfree reinforcement learning in display advertising. In CIKM. ACM, 2018.
 [Yang et al.2018] Y. Yang, R. Luo, M. Li, M. Zhou, W. Zhang, and J.Wang. Mean field multiagent reinforcement learning. arXiv preprint arXiv:1802.05438, 2018.
 [Zhang et al.2014a] W. Zhang, S. Yuan, and J. Wang. Optimal realtime bidding for display advertising. In KDD. ACM, 2014.
 [Zhang et al.2014b] W. Zhang, S. Yuan, and J. Wang. Realtime bidding benchmarking with ipinyou dataset. CoRR, abs/1407.7073, 2014.
 [Zhao et al.2018] J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He. Deep reinforcement learning for sponsored search realtime bidding. arXiv preprint arXiv:1803.00259, 2018.
Comments
There are no comments yet.