1. Introduction
Since its advent in 2009, RealTime Bidding (RTB), a.k.a. programmatic media buying, has been growing very fast. In 2018, more than 80% of digital display ads are bought programmatically in the US (programmatic2018).
RTB is allowing advertisers (buyers) to buy adspaces to digital publishers (sellers) at a very finegrained level, down to the user and the particular ad impression. The problem of using all the information available about each user exposed to some adplacement to deliver a certain amount of impressions, clicks or viewable impressions, in an optimal way, is called the optimal bidding problem.
The optimal bidding problem may come in different flavors. It may be about maximizing a given Key Performance Indicator (KPI): impressions, clicks or views, given a certain budget, or about minimizing the cost of reaching some KPI goal. It is often formulated in a second price auction setup, but different setups, like first price auction or other exotic setups, are common on the market.
In this paper, we focus on the problem of optimizing one campaign, competing with the market in second price auctions. The campaign is aiming at a daily KPI goal, with a penalty for falling short of the goal. This restriction does not harm the generality of this work as most of it is generalizable to other sorts of goals and time spans. The campaign is small enough so that the impact of its delivery on the market is negligible.
Market data is eventually observable, which means it is possible to know, after some time (possibly hours), at what price a given impression would have been bought even if it was lost. This last assumption is valid in our practical setup where a company controls both an inventory and a bidder buying its own inventory on behalf of external advertisers, possibly in competition with third party bidders. This assumption can be relaxed though, at the cost of a more complex training process.
This paper puts a strong emphasis on the way market uncertainty is handled in a context where a fixed goal is to be achieved despite the stochastic nature of the market. Without uncertainty, our problem reduces to a relatively simple optimal control problem, adding randomness makes it an intractable stochastic control problem. In this paper we propose a characterization of the solution in terms of a Partial Differential Equation (PDE) and an approximate solution using a Recurrent Neural Network (RNN) representation.
The main contributions of this paper can be summarized as follows:

It formalizes in section 3 the optimal bidding problem as a stochastic control problem, where market volume and prices are stochastic,

It solves numerically a simple case in section 4 and comments qualitatively the solutions,

It builds a practical RNN that approximate the theoretical solution in section 5,

The RNN is trained and tested at scale on a major adexchange as described in section 6.
2. Related papers
A description of the various challenges brought by the impressionlevel usercentric bidding compared to bulk, inventorycentricbuying is done in (yuan2013real).
(zhang2014optimal; Zhang2016) gives a very broad overview of the optimal bidding problem.
(chen2011real)
solves a bidding problem with multiple campaigns and from the perspective of the publisher using linear programming and duality. A similar question is solved in
(balseiro2014yield; jauvion2018optimal). In those papers, the publisher wants to allocate impressions to campaigns in competition with third party RTB campaigns. (jauvion2018optimal) allows for underdelivery by introducing a penalty for underdelivery in its optimization program.(ghosh2009adaptive) describes a solution to the bidding problem with budget constraints and partially observed exchange.
To account for market uncertainty, the optimal bidding problem is solved using a Markov Decision Process (MDP), constantly adapting to the new state of the campaign on the market. (gallego1994optimal)
proposes a heuristic in the field of yield management.
(Karlsson2014; Karlsson2016; Karlsson2018) propose to use a Proportional Integral (PI) controller to control the bidding process and add some randomness to the bid to help exploration in a partially observed market and alleviate the explorationexploitation dilemma. (cai2017real) uses dynamic programming to derive an optimal policy auction by auction. Modelling the problem auction by auction, makes the proposed methodology slightly impractical. (fernandez2016optimal) gives a very rigorous statement of the problem and solves it in cases where impressions are generated by homogeneous Poisson processes and market prices are independent and identically distributed (IID).The general bidding problem with nonstationary stochastic volume and partially observed market is a complex Reinforcement Learning (RL) problem tackled in (wu2018budget) using tools from the deep reinforcement learning literature. (wu2018budget) uses, as is done in this paper, the common approach of bidding proportionally to the predicted KPI probability and solves a control problem over this proportionality factor every few minutes instead of optimizing for every impression. It makes the approach practical for real uses.
(wu2018budget) finds the use of immediate reward misleading during the training, pushing to solutions neglecting the budget constraint. The approach proposed in this paper introduces budget constraints in the reward by simply adding a linear penalty. The bidder may explore the costly scenarios where it falls short of its goal and avoid them.
Also, the MDP trained in (wu2018budget) uses a state engineered by the author, mainly: the current time step, the remaining budget, the budget consumption rate, the cost per mille at the last period, the last winrate and the last reward. This choice is reasonable but the memory of the MDP is reduced to the remaining budget and what can be inferred from the last period. The approach proposed in this paper does not specify the state space and state transition, the Recurrent Neural Network (RNN) state is learned from the data. In particular it can learn and encode the type of day or the type of shocks the market is undergoing and reacts accordingly.
3. The bidding problem under uncertainty
In this section, the bidding problem is considered in the specific context of a bidder aiming at delivering campaigns in competition with the market, on media owned by itself. This does not harm the generality of the work, but explains the availability of sellside data for training and the kind of objective considered: number of impression at minimum cost. Without sellside data, the training of the model exposed below would be made more complex by the censorship of market data for lost auctions.
3.1. Formal statement of the bidding problem
In this presentation, let be a probability space equipped with a filtration . A bidder is assumed to bid against the market representing all the competition. Let denote the subfiltration encoding the restricted information accessible to the bidder.
All bidders receive ad requests modeled by the jumps of a Poisson process with intensity . For each impression opportunity , happening at the bidder receives ^{1}^{1}1Variables are indifferently noted or . the current features of the impression: timestamp, auction_id, user_id, placement_id, url, device, os, browser or geoloc.
Based on and, more generally, on all past history a bid is chosen . The bidder wins the auction whenever , where is the highest of the other bids in the market.
Each impression has some value to the bidder. When trying to buy clicks, is or depending on the occurrence of a click. The bidder is assumed to know its expected value .
In this paper, the bidder is assumed not to have a significant influence on the market. The bidder has also access to some distribution of conditional on .
This paper characterizes an optimal bidding strategy .
3.2. The bidding problem over a short period of time
The bidder’s spend follows the process:
and the cumulative value follows:
Let us consider a short period of time such that conditionally on , is predictable with average value and , , are independent, identically distributed (IID) over . Let us consider that and are such that
is large and its relative standard deviation small
^{2}^{2}2 In practice would be in the order of magnitude of 100 seconds while close to 1000 events per second so the relative error would be around .:Over a period of time , the set of impression is noted and the number of impression is almost deterministic . Because , , are IID, each auction is brand new and the only relevant information for the bidder is , that is . In a Second Price Auction setup^{3}^{3}3See, e.g., (roughgarden2016twenty) for an introduction to second price auctions., the spend is
and the value is
Because the , and are IID and the values summed over a large number of impressions, everything becomes deterministic and reduces to
and
where is the Probability Density Function (PDF) of conditional on and is the Cumulative Density Function (CDF) associated.
The optimization program of the bidder can be written
It can be read: the bidder chooses a bidding strategy such that its overall cost is minimized, while its goal (in number of impressions of clicks) is reached. The cost is composed of the spend incurred by the purchase of impressions, and a possible linear penalty paid if one falls short of the goal ^{4}^{4}4Note that the penalty does not have to be paid at the end of the short period of time, because the goal is additive, this short period of time can be combined with other periods of time as in the next section.. This is not the most common formulation of the problem but it fits the practical need described above^{5}^{5}5 A more common formulation would be to maximize the value, with a penalty for exceeding some budget:
with .
This means that the optimal strategy is to bid a value proportional to . If we restrict to the case where the bidder tries to buy a certain amount of impression at the best possible price, then the optimal strategy is to bid a constant bid. For the rest of this work, and without loss of generality, the bidder aims at buying a certain amount of impression, hence all are , and
Most of these results do not hold for longer than if the auctions are no longer IID or non predictable. In practice, random external factors affect the total volume of impressions in unpredictable ways. The market conditions is also prone to large shifts, e.g. when a very large campaign crowds out the other campaigns on a particular inventory.
As much as it is safe to assume everything is usually well predictable for the next few minutes, it is no longer the case for longer timescales.
3.3. The bidding problem over a full day
Because the market is unpredictable, the bidder knows that no matter what he plans based on , he has to adjust constantly to new available information to reach its daily goal. For this reason, the bidding strategy is now modeled as a Markov Decision Process (MDP) as in (gallego1994optimal; Karlsson2014; Karlsson2016; Karlsson2018; wu2018budget; cai2017real), see Figure 1.
The full day is split in periods of duration . For each period, the bidder sets based on , the remaining goal and a state coding for everything he learnt from past experience that is relevant, and he observes the common distribution of all whose PDF and CDF are noted and . Note that the bid distribution is fully determined by the current state .
The expected spend given knowledge of the state is
and the expected volume of impression is:
Let be the lowest cost achievable to deliver impressions over given knowledge .
is simply the sum of all spends plus the penalty times the shortfall given the optimal bidding strategy.
The optimal control is therefore fully characterized by the following Bellman equation:
(1) 
where is the transition function, taking the current state and returning the next state . Because has no impact on the market, the transition function can be noted .
The first order condition on gives:
It can be noted that at each period , the bidder optimizes for the current period, knowing it has to optimize for the remaining periods, up to . Also, the optimal is chosen equal to the marginal expected cost. When the goal is far from being reached, will be high, else it will be low.
It can also be proved that is in the interval . It is clearly the case for , but it is also the case for as long at is it true for , because as a function of is a mixture of , which suffices to show
(2) 
In the special case where is small enough and continuous, the problem can be usefully expressed in continuous time.
3.4. Solution in continuous time
Let us solve the optimal bidder problem in a continuous time setting with a simple Brownian motion model of available volume:
(3) 
where is a onedimensional Wiener process. The spend intensity at is
and the volume intensity is
In this model, the available volume is stochastic but the bid distribution is constant.
The minimization problem writes
The HamiltonJacobiBellman (HJB) equation states that
with the limit condition .
At the minimum verifies the first order condition
which reduces to:
The HJB equation can be solved in :
with the limit conditions:
4. A numerical resolution
In the special case where , , and
is the CDF of a gaussian distribution,
solves the partial differential equation(4) 
which can be solved numerically for various levels of uncertainty.
The numerical solutions in Fig. 2 and Fig. 3 show that the introduction of uncertainty in the market induces some provisioning behavior in the optimal strategy. This provisioning for risk is materialized by a decreasing bid with time whenever the risk does not materialize, which is the case in those simulations where is constant except for a shock at the end of the day.
The MDP approach solves the bidding problem under uncertainty in a satisfactory way, but in the general case the relevant information in all the information available: , is not obviously observable and using all requires working in spaces too large to be practical.
The general case can be approximated. Such an approximation needs to be chosen in a functional space rich enough to capture the desired features. In Section 5, RNNs are tested to this end.
5. Practical MDP bidding
As demonstrated above, a robust bidding strategy should adjust its bidding behavior continuously based on the last information available, but also based on the fact it has to adjust in the future.
Building such a system is complex. Let us say a bidder records the history of all the bids , spends , purchased volumes and any other relevant information for each auction . This sequence is noted
where .
A bidding strategy should be a function of time, remaining goal and the finite sequences to . Solving a minimization problem on such a space is largely intractable, even numerically, so we rely on some finite dimensional representation of , enabling a fair approximation of the solution:
The state is not updated for every auction, but instead at a regular pace. It is computed based on , the remaining time to deliver , the remaining volume to reach and the last spend :
The transition function is trained to minimize the cost of the campaign (cf. Fig. 4).
In the next two sections, two different practical implementations of a bid controller to provide an approximation of the solution are presented: the Proportional Integral controller in Section 5.1, and the Recurrent Neural Network controller in Section 5.2.
5.1. The Proportional Integral controller
The Proportional Integral (PI) controller^{7}^{7}7See (Astrom2008, Chapter 10) for an introduction to the PI control. is widely used in various industries (Desborough2002). (Karlsson2014; Zhang2016) propose to apply it to the bidding problem.
The interaction between the bidding agent and the market can be modeled as a feedback system composed of a feedback controller and a block representing the RTB market. The system receives as input a reference signal : a target volume for the next time period.
From the feedback received from the market, the controller computes a control error . Based on it, the controller maintains a state and uses it to generate a new control variable (or action of bidding at a specific bid level)
(5) 
where , are two parameters called the proportional and integral gains.
In the PI setup, the state and its transition function , where , can be expressed
(6) 
and
(7) 
Training the PI controller is done in two steps: a reference forecasted volume process
is defined and trained, and then the gains are tuned using Stochastic Gradient Descent.
Although simple and robust, this approach comes with some flaws.

It depends on a separate forecaster .

It is designed to target a current value, not to optimize a lifetime cost function; this is mitigated by the fact the parameters are tuned against the lifetime cost.

The uncertainty about the market is not modeled, it is barely taken into account through the cost function, but no component of the state really reflects anything about risk.

The important gap between the small number of parameters of the PI model and the large amount of data available suggests probable underfitting. Capacity can be added to the model by allowing adaptive gains, setting thresholds and special cases, but those are merely local patches.
To overcome these flaws, we introduce in the next section a new approach leveraging a Recurrent Neural Network to approximate the bidding problem solution. A PI controller is used as benchmark to the RNN approach.
5.2. The RNN controller
The Recurrent Neural Network (RNN)^{8}^{8}8See (Goodfellowetal2016, Chapter 10).
controller unit used in all the experiments presented in this paper is a Gated Recurrent Unit (GRU, see
(cho2014learning))^{9}^{9}9Long ShortTerm Memory (LSTM, cf (hochreiter1997long)) was also experimented with but the results were similar to the GRU ones., with input::

a vector
,  state::

a vector with dimension 16^{10}^{10}10
Simple trials were also conducted to assess the interest of using more neurons in recurrent units in the RNN architecture, be it wider or deeper. No significant gain was found, and a detailed assessment lies beyond the scope of this paper.
,  activation::

a hyperbolic tangent function rescaled for the first component of the state to be between and the penalty level ^{11}^{11}11 is the highest possible bid in an optimal strategy, cf Eq. (2),
and where the bid level is given by the first component of the state of the GRU layer:
(8) 
Through the recurrent connections the model can retain information about the past in its state, enabling it to discover temporal correlations between events and its interactions with the environment even when these are far away from each other in the data. Using a RNN allows to take advantage of a much richer family of functions to learn an approximate solution to the bidding problem.
6. Experiments
6.1. Practical setting
In practice, the massive number of auctions occurring simultaneously makes unrealistic the resolution of the optimal control (1) for each auction and campaign. Fortunately, taking periodic control decisions (e.g. every 5 minutes) on aggregated feedback is sufficient. It is thus possible to handle a very large number of campaigns, with the following steps:

at the beginning of each period, choose a level for the control variable ,

get an aggregated feedback (realized volume and spend) from the previous period in response to the level .
This kind of architecture introduces discontinuity in the response of the controlled advertising system. Yet, the problem can be turned into a continuous control problem (cf. (Karlsson2016; Karlsson2018)).
Furthermore, response curves exhibit discontinuities (cf. Fig. 5, left plot). These discontinuities can be smoothed out (right part of Fig. 5) by not bidding a constant bid level during the time period but by drawing bid prices according to some distribution (e.g. lognormal, Gamma, etc.) around the control variable , such as proposed in (Karlsson2014; Karlsson2018).
This leads to the following loop:

at the beginning of each period, choose a level for the control variable ,

for each auction occurring in the period, draw a bid price according a distribution based on .

get an aggregated feedback (realized volume and spend) from the previous period.
Note that contrary to the general setup of the bidding problem that suffers from censorship due to the ad auction selection (cf. (Zhang2016)), one can alleviate this particular issue in this practical setup since publisher data is available. In the absence of uncensored data, the bid randomization would further help, by realizing some of the exploration effort in the explore vs exploit dilemma introduced by biddependent censorship.
6.2. Data used
Two types of datasets are used in the numerical experiments presented in this paper:

Simulated synthetic data sets, generated by applying various transformations (shocks or random walk) throughout time to a base linear volume curve, that helps in appreciating salient features of the RNN models,

Production RTB data, constituted by 5minute snapshots of the actual pricevolume mapping for all display ad placements with significant daily volume^{12}^{12}12Restricted to the placements with a minimum of 1000 daily impressions. from large publishers on one of the leading Supply Side Platform (SSP) and Ad Exchange globally. The ad placements here can be seen as proxys of segments targeted by campaigns. In production, the RNN would have to be trained on currently running campaigns.
The production dataset is created using logs from actual RTB auctions run on around 1000 ad placements over 8 days, containing about 115M won impressions. All these impressions are used to build winning bid distributions for 5minutes periods over a full day () of each ad placement. The winning bid distributions are discretized on a CPM bid scale with 100 exponential increments between and .
For offline training and evaluation purposes on production data, a bidding problem instance is comprised of a random draw of an actual bidvolume mapping process and of a random volume goal, uniformly drawn between 10 and 1000. The controller therefore is exposed to scenarios with not enough of volume to meet the target given the penalty level, as well as scenarios where enough volume was available.
The production data is split into nonoverlapping training, validation and evaluation datasets using different days. The training set of models on production data contains 1 million different bidding problems and evaluation is performed on a set of around 110K bidding problems of increasing difficulty. For the simulated case study, given the simplified setting, training is stopped after learning from 20K bidding problems.
6.3. Training and evaluation
The implementation of both the benchmark (PI controller) and the RNN controller is done in TensorFlow
(tensorflow2015whitepaper). The input data instances are randomly shuffled and processed by batches of size 100.The aim is to minimize the total cost of a campaign, so the training loss is composed of the sum of the spend and the penalty terms over a full day:
(9) 
where the spend and volume won are computed from the bid level of the MDP bidding controller by simulating the feedback using the input bid distribution at each time step and propagating the state over the full sequence of time periods.
Models are trained using Stochastic Gradient Descent (SGD) with an inversetime decay of the learning rate^{13}^{13}13The learning rate schedule is the following: with initial learning rate , decay rate , and decay steps .
. To help alleviate possible exploding gradients issues, gradient clipping is used as described in
(Pascanu2012). Other more sophisticated optimization methods have been tried without significant impact on the results.Crossvalidation is performed regularly during the training on a fixed set of validation bidding problems and the optimal model parameters are picked as the best evaluation seen on the validation set during the training. In practice, no model presented any overfitting issue as performance results generalized well to unseen data.
6.4. Numerical results
6.4.1. Simulated case study
A first experimental setup on simulated data goes back to the simplified setting from Section 4. Figure 6 displays the bid level, volume and spend during the day, along with the final delivery cost decomposed into the final spend and penalty for five different RNN models. Each RNN model is trained on a simulated dataset for which the volume process follows Eq. 3 with a different level of noise (and no drift term). Each column evaluates the same model on six cases without noise, for which a permanent shock in the available volume happens for all dates . A shock factor of means that after the shock the available volume given the same bid level is divided by .
Note that the RNN model is able to learn a good approximation of the optimal strategy derived theoretically. The RNN controller exhibits the same behavior as the one evidenced in Figures 2 and 3 by exact resolutions of the control problem. Indeed, the bid strategy is constant when noise is absent during the training. Hence the volume acquired and spend linearly grow through time in the scenario without any shock in the available volume. In scenarios with permanent shock factors, this model carries on bidding at the same level, thus falling short of the volume target of impressions. As the final volume shortfall grows with the shock factor, the bigger the shock occurs, the bigger the end penalty is.
One can observe that the more noise a RNN controller model was trained with, the higher it starts bidding and the easier it absorbs shocks in the available volume. The models trained with more uncertainty in the available volume tend to provision for the shortfall risk by bidding higher and earlier than the optimal level in the absence of noise. To the extreme, the optimal solution is to buy as fast as one can (without bidding higher than the penalty). This obviously entails a higher spend when the risk does not materialize. However, the more risk was anticipated by a model through its training, the lower the final spend and penalty are when risk do realize, demonstrated here by increasing level of permanent shocks.
In a second experimental setup on simulated data, two RNN models were trained with either a low or high^{14}^{14}14The high noise scenario is calibrated to be consistent with the average level of daily noise observed on our real data, its interquartile range being . So the low noise scenario does represent a situation where risk is deeply underestimated. level of noise and evaluated under both a low or high noise scenario. Main results are shown in Table 1. The experiment shows as expected that for both models the cost of delivery and the bidding problem difficulty (probability of shortfall) are higher in the evaluation scenario with high noise. However, given an evaluation scenario, i.e. an amount of noise that realizes, the delivery cost is lower for the model that has been trained on data with the same amount of uncertainty.
Mean final cost (std. dev.), shortfall probability  

Evaluation scenario  
Learning  $271 ($13), 5%  $915 ($516), 82%  
scenario  $431 ($9), 0%  $764 ($405), 57% 
6.4.2. Training on production data
Performance results on actual market data are available in Table 2. As a benchmark, a PI controller is tuned to follow a reference pacing curve fitted on the training data. Indeed, a good approximation of the internet traffic intraday seasonality can be obtained using a model with only two harmonics (Karlsson2014).
Table 2 details the average total cost of delivering campaigns of increasing daily volume goals for both the PI model and the RNN model. Overall, the RNN model is able to reduce delivery cost by about 20% compared to the PI model. As the volume target increases, so does the bidding problem difficulty as the total available volume under the penalty level is constraining the bid strategy for a larger share of the dataset. Eventually, for very large volume goals the optimal strategy is to bid the penalty level, capturing all the volume below this level and paying the penalty for each missed impression. Thus lower performance improvements are expected for the larger goals, relative to the size of the targeting.
Delivery cost ($CPM)  ratio  
Goal (imps)  PI  RNN  RNN/PI 
100  1.02  0.82  0.80 
500  1.28  1.02  0.79 
1000  1.60  1.27  0.80 
1500  2.06  1.76  0.85 
7. Conclusion and future works
The RNN controller model proposed in this paper provides both an effective and practical method to solve the optimal bidding problem. It has the advantage not to rely on manually engineered features to represent knowledge about the current state or history that could be leveraged in a bidding strategy, but instead infers it from the data. For instance, a more advanced, adaptative, PI controller could be employed to tackle the control problem, e.g. by using splines modeling the pricevolume mapping to efficiently store the response gain at various price points. However, such a model would still lack useful elements from the very complex state space it evolves in, mainly because it overlooks the impact that uncertainty about future market volume and bid landscape has on the optimal strategy.
Numerical experiments demonstrate that the proposed approach is able to improve significantly on existing bidding controllers, while being trainable and usable at production scale. The approximation of the state and space transition provided by the RNN leads to a solution that captures a key aspect of the solution, namely provisioning against the risk of underdelivery.
This work could be extended in many ways:

The observability of all bids including those of lost auctions is convenient in the case of our work. This assumption could nevertheless be relaxed. The reconstruction of bid distributions for training would probably be more complex and the noise added to the bid would need to be used as an exploration device.

In practice, setting multiple goals would be an interesting feature to add, e.g. buying impressions with some guarantee of viewability. The equations would be marginally changed.

The first price and exotic auction cases add a significant amount of complexity to this approach, however those questions would be resolved at the impression scale, while the macroscopic (5 min) scale control problem would probably hold in a similar way.

Giving the RNN some more feedback, based on the noise injected in the bid could probably help.
More generally, this paper shows how RNNs can be applied to complex control problem with success.
Comments
There are no comments yet.