In 2017, the revenue generated from ads for mobile apps was more than billion USD . This is not surprising given that there are over 5 million apps (in Google PlayStore and Apple App Store) , and most apps struggle to achieve a large user base. To attract more users, apps naturally resort to online advertising platforms. Such platforms (e.g., Yahoo’s Gemini ), drive app install campaigns by showing ads on owned-and-operated properties (e.g., Yahoo mail, Tumblr and Yahoo Finance in case of Gemini) as well as third party publishers via an external RTB ad exchange (e.g., MoPub). Such RTB ad exchanges offer diversity and scale for app install advertisers; but they also introduce new challenges as described below.
In an RTB ad exchange, multiple bidders participate in an auction for each ad display request issued by a publisher (via the exchange). Each bidder could be managing multiple campaigns at the same time, and showing ads across multiple publishers through the exchange. In practice, there is considerable heterogeneity across publishers in terms of ad request volume, audience quality and auction floor prices. The bidder’s profit also keeps evolving over time, and so do the efficiencies across campaigns (depending on costs charged to advertisers). So when a bidder selects a campaign for an ad request, while deciding how much should it bid and how much should it charge the advertiser, it naturally faces the question: is my bid and cost for the campaign worth it given the publisher, my current profit and the campaign’s current efficiency? Sensitivity to advertiser’s budget adds yet another constraint for the bidder. Intuitively, feedback from past decisions and outcomes can assist the bidder in coming up with better future decisions amidst the challenges mentioned above. However, unlike regular cost-per-click (CPC) ads  where feedback (click/no-click) is fast, in the case of app install ads there can be considerable delays in knowing whether the user installed the app after clicking on the ad. The delay stems from the following issue: after clicking an app install ad, the user is typically taken to the Google PlayStore or Apple App Store (external to the bidder, publisher and advertiser). The bidder gets to know about the ad conversion (install) only when the user opens the installed app for the first time (typically conveyed by the app advertiser or third parties). Such delays can span days as shown in Figure 1 (for app install ads managed by Yahoo Gemini).
Current RTB literature focuses on just some of the above mentioned objectives. In [5, 6] the focus is purely on profit maximization; whereas  focuses purely on campaign efficiency. In the context of learning from past decisions and outcomes, 
employed value iteration (a form of reinforcement learning) to solely optimize for campaign efficiency for CPC ads. In fact, the notion of campaign efficiency (i.e., the discrepancy in cost-per-action delivered versus the advertiser’s target cost-per-action) in current RTB literature is mostly oriented towards clicks as actions; this does not involve feedback delays as described above for app installs. To the best of our knowledge, our work is the first to address both profit and campaign efficiency a coupled manner, specifically in the context of app install ads. We develop a state space framework, and leverage Q-learning  (also a form of reinforcement learning) to learn from the outcomes of past decisions. Our main contributions can be summarized as listed below:
a state space approach which encompasses campaign’s efficiency, advertiser’s budget and bidder’s profit,
a Q-learning algorithm to learn a state space based policy for determining the bid at the exchange and cost charged to advertiser for each ad request. The novelty lies in our design of the reward function which accounts for feedback delays.
The remainder of the paper is organized as follows. Section II covers the paper’s setup, and problem formulation. In Section III we discuss our state state approach in Section IV we explain the proposed Q-learning algorithm. Finally, in Section V, we describe our experimental results based on mobile app install ads data from Yahoo Gemini .
In this section, we first provide some background on online advertising via RTB. This is followed by a description of the exact setup considered in this paper and underlying objectives. Our setup is fairly standard in the online advertising industry , and resembles the Yahoo Gemini offering for app install advertisers; the primary motivation behind this paper.
Ii-a Online advertising via RTB
In RTB, several bidders participate at an ad exchange, and bid for ad display opportunities provided by publishers affiliated to the exchange. The typical sequence of events that takes place during an RTB auction can be described as follows. When a user visits a publisher (e.g., website, app), the publisher conveys an ad display opportunity to the exchange; this includes details like the user’s identifier (e.g., mobile IDFA or AAID) and floor price for the auction. The ad exchange then relays this information to the bidders. At this stage, an interested bidder finds the best matching ad (from its current list of campaigns). Such a match might be based on the predicted click-through-rate (pCTR, i.e., ), and predicted conversion/install rate (pCVR, i.e., ) associated with the display opportunity; in particular, details on ranking app install ads via pCTR and pCVR models can be found in . Having selected the best ad for the opportunity, the bidder decides: (i) its bid at the auction, and (ii) the corresponding cost to be charged to the advertiser. If the bidder wins in the (second-price) auction , it has to pay the exchange only if its ad is shown to the user (i.e., receives impression). However, under the CPC pricing model , the bidder can charge the advertiser (i.e., the determined cost) only if an user clicks on the shown ad.
Although an app install advertiser is charged for clicks, it is typically interested in campaign efficiency. Efficiency is defined as , where stands for cost-per-install (i.e., the app-install equivalent of cost-per-action). The is the total cost charged to the advertiser divided by total installs received. To capture a notion of satisfaction across all app-install advertisers associated with a bidder, we define happy campaigns as the count of campaigns with below , where represents the tolerance relative to the target CPI provided by advertisers. The differences in when the bidder is charged versus the advertiser is charged, and the observation that the app-install advertiser is more concerned about CPI than CPC, leave quite some room for the bidder to optimize its bid and cost decisions for each display opportunity.
Ii-B Problem formulation
We consider a bidder which is handling app-install ad campaigns; campaign has a target and budget for time horizon (same across campaigns). The (sole) ad exchange, with which the bidder interacts, is associated with publishers. The bidder’s cumulative spend on publisher at time is denoted by , and represents the amount paid by the bidder to the exchange for the impressions shown on publisher . Similarly, we denote the cumulative advertiser cost (across all advertisers) charged for ads shown on publisher by . The margin is defined as , which is indicative of the relative profit/loss being made on publisher . The bidder’s goal is to maximize the number of happy campaigns (with ), and the overall margin (i.e., ) at the end of the time horizon, i.e., at .
Iii State space approach
We first describe an intuitive approach (in Section III-A) which worked reasonably well in our experiments. Drawing insights from this intuitive approach, we then describe the detailed state space formulation in Section III-B; the proposed Q-learning algorithm (in Section IV) is based on this state space.
Iii-a An intuitive approach
Consider a point of time , where the bidder is placing a bid for campaign for an opportunity in publisher . Assume that the bidder had already computed the bid (at the exchange), and the cost (to the advertiser) for this particular opportunity based on data from the past. This could be the standard expected-cost-per-impression () bid [3, 4] based on the pCTR and pCVR associated with the opportunity (i.e., bid eCPM Target CPIpCVRpCTR, and cost Target CPIpCVR). But at this point of time, the bidder suddenly gets to know the current publisher margin (), and campaign efficiency (), and has the option of update the bid and cost. An intuitive approach to do so based on and (and also keeping in mind the goals outlined in Section II) would be as shown in Figure 2.
For example, when the current efficiency is bad and margin is negative, the bidder is better off reducing both its bid and cost. Increasing the bid would hurt the margin, while increasing the cost would hurt the efficiency. But when the margin is negative and efficiency is good, the bidder can afford to charge the advertiser more (by increasing the cost) while decreasing its bid for better margin. At a high level, Figure 2 defines a discrete state space based on current efficiency and margin, and then takes an intuitive step based on the state. In Section III-B we generalize such a state space approach, and later in Section IV, we show how one can learn the best action (i.e., whether to decrease/increase and the magnitude of change) for a particular state.
Iii-B State space formulation
Drawing motivation from the intuitive approach in Section III-A, we define a discrete state space , where , , and represent quantized publisher margin, campaign efficiency, and campaign budget respectively. The main features of such a state space are described below.
Quantization is done on the basis of domain knowledge. For margin, we consider a bin around zero of the form , and uniformly sized bins (width ) to the left and to the right of the ’zero’ bin (similar to the quantization in Figure 2). For efficiency, we partition the intervals and into uniform bins each (where is a predetermined upper bound). For budget, the fraction of budget remaining is binned into uniformly spaced bins in the interval . As a result of the above quantizations, we obtain a discrete state space which not only captures our objectives, but also simplifies the policy learning process (as described in Section IV). Also, the granularity of the state space is such that it is not expensive to infer the current state of the system; in a large scale setup with thousands of campaigns and publishers, maintaining near real time aggregates of publisher wise margins and campaign wise costs is way simpler than aggregates for each (publisher, campaign) pair.
A natural question that arises in any state space based dynamical system is if it is possible to drive the system from any initial state to any final state via a (finite) sequence of inputs (i.e., controllability ). In our setup, at least two factors make the system inherently uncontrollable: (i) finite advertiser budget, and (ii) variability in the volume of ad requests from a publisher. As a result, we cannot employ any generic control system which assumes controllability. However, some states are reachable from any initial state, and hence our setup satisfies a weaker version of controllability called reachability . The challenging part in our setup is to drive a (publisher, campaign) pair to a desirable state to eventually meet our objectives.
Iii-B3 State based bid and cost updates
To drive the current state of a (publisher, campaign) pair to desirable states, a simple strategy is to have additive updates to the bid and cost depending on the current state. In this paper, we consider additive bid and cost updates of the following form:
where , , and are constants. Discrete functions , , map the quantized versions of current margin (), leftover budget fraction (), and current efficiency () to discrete values. For further simplification, we assume that: (i) , and (ii) the range (i..e, the discrete set of possible output values) of and is fixed for our setup (determined using domain knowledge). For example, the set of possible output values for could be . Hence, the remaining task at hand is basically to ’learn’ which state should map to which output value for functions and , so that our end goals are met; this is precisely what we cover in Section IV. For consistency with standard methods in the online advertising industry [3, 4, 10], we assume that, before each update, the bid equals eCPM ( Target CPIpCVRpCTR) and cost equals Target CPIpCVR; note that the pCVR and pCTR can vary for each display opportunity due to differences in features derived from the campaign, publisher and online user .
In general, there are two ways one can go about learning functions , and described in Section III-B3. One way is to mimic the entire RTB setup via a state transition model, and optimize , and around it. But building such a complex model is not practically feasible, and the learnt , and would seriously suffer from modelling errors. Another way is a model free approach, e.g., Q-learning which learns , and directly from past decisions and outcomes. In Section IV-A, we describe the proposed Q-learning algorithm for our setup; this is followed by a detailed description of the associated Q-learning reward function in Section IV-B.
As mentioned in Section III-B3, at each update step, we take a compound action (in the form of changes in bid and cost). Due to the discrete nature of the functions , and , there are only a finite number of actions that can be taken in each state. Thus, if there are possible values for , and possible values for , there are possible actions (i.e., the action space which is the product of ) for any state . The standard Q-learning update step  can now be stated in our context as follows:
where and represent the state and action at time , is the reward at time , is the learning rate, and is the forgetting factor, and is the space of possible actions. At each time, the bidder selects an action, and observes the corresponding reward (to be defined in Section IV-B), and then updates the value. We describe below some important properties associated with this update in our context.
Iv-A1 Learning rate and forgetting factor
The learning rate determines to what extent the newly acquired information is weighed in comparison to the old information. Theoretically, a decaying ensures convergence, but results in very slow convergence rates, hence a small but constant value of suffices. The forgetting factor determines the importance of incorporation of future rewards. A low makes the agent myopic as it then considers only current rewards, while close to makes the agent strive for long-term high rewards.
Iv-A2 Exploration vs. exploitation
Note that the maximization step in (3), i.e., is a greedy procedure. The convergence of the algorithm depends on the balance between exploration and exploitation. To converge faster, a natural step is to resort to an
-greedy policy, where with probabilityone chooses the ’max’ action in (3), and with probability one chooses a random action from the action space . In particular, we resort to Boltzmann sampling  in our setup. This means that, during exploration, the probability of selection an action given a state, is given by , where the temperature parameter , is decayed slowly over time so as to slowly reduce the exploration. In addition, the learning rate can be chosen in a systematic way for different states, so as to quicken the learning pace for states which are not visited often.
Iv-A3 Deterministic vs. stochastic policy
It is crucial to learn a deterministic policy for our setup, as a stochastic policy might result in states drifting off from the actual objective.
Once the Q-learning algorithm converges, the best action corresponding to a given state is given by .
Iv-B Q-learning: Reward Functions
Given the state space example in Figure 2, the following challenges are encountered while designing a suitable reward function for Q-learning.
Iv-B1 Unobservable spend and cost
Keeping in mind a large scale setup with many publishers and campaigns, we assume that the bidder maintains data only at a publisher and campaign granularity. This means, the bidder tracks only publisher-wise spend across all campaigns, and campaign-wise cost and installs across all publishers. The bidder does not track publisher-wise spend at a campaign-level, or the advertiser-wise cost and installs at a publisher level. This brings in some sense of unobservability regarding the effectiveness of cost and bid update actions in our setup. For instance, the margin of a publisher is affected by all bid and cost update actions undertaken for all campaigns associated with the publisher. Similarly, the efficiency of a campaign is affected by all bid and cost update actions involving publishers where the campaign is being bid for. Hence, a suitable reward function should be able to attribute margin and efficiency changes of a publisher and campaign respectively to individual actions.
Iv-B2 Sparse and delayed rewards
The transition of a publisher’s margin from a negative margin state to a neutral/positive margin state in the interval of two consecutive actions is usually very unlikely. Similarly, the efficiency of a campaign is unlikely to change in the interval of two consecutive actions. The transition is brought about by a sequence of actions, rather than just an action. Hence, a suitable reward function should reward actions through intermediate rewards rather than only on change of state, i.e., it should reward the change in margin and efficiency of a publisher and campaign respectively.
Keeping the above points in mind, we propose a reward function of the following form:
where and are weights which map the amount of attribution to be assigned to the compound actions , and hyper-parameter trades-off the importance given to change in margin versus the change in efficiency. In particular, , and . The weight incorporates the spend ratio of the publisher as compared to the total spend, into the reward, and hence tries to approximately attribute the cause of change of to to the compound action local to publisher and campaign . Similarly, incorporates the ratio of the budget of campaign as compared to the total unutilized budget, into the reward and hence tries to approximately attribute the cause of change of to to the compound action local to publisher and campaign . The reward function proposed in (IV-B2) not only considers both the objectives of our setup, but also rewards actions at each step.
In this section, we discuss experimental results based on data from mobile app install campaigns managed by Yahoo Gemini . The state space was quantized as follows: bins for efficiency with efficiency threshold , bins for margin, bins for left-over budget. The cardinality of range of functions and was set to and . We considered a sample of publishers and campaigns. Our training data covered one week of impression level data spanning the selected publishers and campaigns, and the testing data covered the following week. The policy evaluations were carried out using the RTB simulator that we describe in Appendix -A. The baseline for the performance improvements stated below is a PI controller  just optimizing for the margin.
As shown in (IV-B2), the hyper-parameter in the reward function provides a way to trade-off performance lifts in margin versus the number of happy campaigns. Table I, clearly shows this performance trade-off (as well the budget utilization) for different values of . Note that corresponds to a policy which focuses only on margin improvements, while at the other extreme, corresponds to a policy focused on efficiency improvement.
Figure 3 shows the trade-off between lift in margin versus the lift in number of happy campaigns; it also shows there is a sweet spot on the curve where a bidder might like to operate (leading to good margin without much efficiency loss).
In our experiments, we observed that driving the margin and efficiency of a (publisher,campaign) pair to a favorable state becomes relatively easier with higher number of publishers and campaigns. It is interesting to note that the campaigns which start off as being unhappy have a higher chance of turning happy if bid for in multiple different publishers. Campaigns which are bidded in only a few publishers, and are unhappy to start with, are less likely to be turned to happy campaigns. The above observation leads to the question: how many publishers does a campaign need to be bid in, so as to be controllable, i.e., the efficiency of that campaign can be driven from any state to any desired state? This is indeed a question that needs further exploration and is of practical interest. We believe our state space approach provides a reasonable framework for pursuing such questions, and our Q-learning approach is able to capture the performance trade-offs (e.g., efficiency vs. margin maximization) inherent to the setup.
-  B. of Apps, “App revenues 2017,” http://www.businessofapps.com/data/app-revenues.
-  Statista, “Number of apps available in leading app stores as of March 2017,” https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/.
-  Yahoo Gemini, “Drive app installs on mobile,” https://developer.yahoo.com/gemini/advertiser/guide/adcreation/drive-app-installs-mobile/.
-  A. Z. Broder, “Computational advertising.” in SODA, vol. 8, 2008, pp. 992–992.
-  C.-C. Lin, K.-T. Chuang, W. C.-H. Wu, and M.-S. Chen, “Combining powers of two predictors in optimizing real-time bidding strategy under constrained budget,” in CIKM. ACM, 2016, pp. 2143–2148.
-  J. Fernandez-Tapia, O. Guéant, and J.-M. Lasry, “Optimal real-time bidding strategies,” Applied Mathematics Research eXpress, pp. 1–42, 2016.
-  W. Zhang, Y. Rong, J. Wang, T. Zhu, and X. Wang, “Feedback control of real-time display advertising,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, 2016, pp. 407–416.
-  H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo, “Real-time bidding by reinforcement learning in display advertising,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017, pp. 661–670.
-  A. Gosavi, “Reinforcement learning: A tutorial survey and recent advances,” INFORMS Journal on Computing, vol. 21, no. 2, pp. 178–192, 2009.
-  N. Bhamidipati, R. Kant, S. Mishra, and M. Zhu, “A large scale prediction engine for app install clicks and conversions,” in Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM 2017).
-  P. Antsaklis and A. Michel, Linear Systems. Birkhäuser Boston, 2005.