# Response Prediction for Low-Regret Agents

Companies like Google and Microsoft run billions of auctions every day to sell advertising opportunities. Any change to the rules of these auctions can have a tremendous effect on the revenue of the company and the welfare of the advertisers and the users. Therefore, any change requires careful evaluation of its potential impacts. Currently, such impacts are often evaluated by running simulations or small controlled experiments. This, however, misses the important factor that the advertisers respond to changes. Our goal is to build a theoretical framework for predicting the actions of an agent (the advertiser) that is optimizing her actions in an uncertain environment. We model this problem using a variant of the multi-armed bandit setting where playing an arm is costly. The cost of each arm changes over time and is publicly observable. The value of playing an arm is drawn stochastically from a static distribution and is observed by the agent and not by us. We, however, observe the actions of the agent. Our main result is that assuming the agent is playing a strategy with a regret of at most f(T) within the first T rounds, we can learn to play the multi-armed bandits game (without observing the rewards) in such a way that the regret of our selected actions is at most O(k^4(f(T)+1)log(T)), where k is the number of arms.

## Authors

• 1 publication
• 7 publications
• 11 publications
• 1 publication
• ### Fair Algorithms for Multi-Agent Multi-Armed Bandits

We propose a multi-agent variant of the classical multi-armed bandit pro...
07/13/2020 ∙ by Safwan Hossain, et al. ∙ 0

• ### Lenient Regret for Multi-Armed Bandits

We consider the Multi-Armed Bandit (MAB) problem, where the agent sequen...
08/10/2020 ∙ by Nadav Merlis, et al. ∙ 5

• ### Decentralized Cooperative Stochastic Multi-armed Bandits

We study a decentralized cooperative stochastic multi-armed bandit probl...
10/10/2018 ∙ by David Martinez Rubio, et al. ∙ 0

• ### Online Learning for Active Cache Synchronization

Existing multi-armed bandit (MAB) models make two implicit assumptions: ...
02/27/2020 ∙ by Andrey Kolobov, et al. ∙ 7

• ### Social Learning in Multi Agent Multi Armed Bandits

In this paper, we introduce a distributed version of the classical stoch...
10/04/2019 ∙ by Abishek Sankararaman, et al. ∙ 0

• ### Quantifying the Burden of Exploration and the Unfairness of Free Riding

We consider the multi-armed bandit setting with a twist. Rather than hav...
10/20/2018 ∙ by Christopher Jung, et al. ∙ 0

• ### Bounded Rationality in Las Vegas: Probabilistic Finite Automata PlayMulti-Armed Bandits

While traditional economics assumes that humans are fully rational agent...
06/30/2020 ∙ by Xinming Liu, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, our goal is to build a theoretical framework for predicting advertiser response based on observations about their past actions. Our model is driven by a few important considerations. First, the advertisers face an uncertain environment, and optimize their objective in presence of uncertainty. As in [17], we capture this by modeling the advertiser as an agent solving a regret minimization problem in a multi-armed bandit setting. In our motivating application, each arm can correspond to an ad slot the agent can purchase or to a discretized value of the bid the agent submits. We make no assumption on the type of algorithm the agent is using except that it has bounded regret. Second, we are concerned with an environment that is changing, and therefore requires the agent to respond to this change. We model this by assuming each arm has a cost, and in each round, the agent is informed about the cost before he has to choose which arm to play. This is the main point of difference between our model and the model in [17], and is an important element of our model, since without this, to predict which arm an agent is going to play, it is enough to look at their past history and select the arm that is played most often. The assumption that the cost of each arm is observed before the agent picks which arm to play is not entirely accurate in our motivating application, since advertisers only learn about the cost of their ad after it is placed. However, given that in practice costs change continuously over time, the advertisers can use the cost of each arm in the recent past as a proxy for its current cost. Therefore, we feel this assumption is a justified approximation of the real scenario.

Finally, we model the objective of our prediction problem. In our model, once the agent decides which arm to play, they receive a reward from that arm that is drawn stochastically from a static distribution.333In our motivating application, the reward can be the profit the advertiser makes if the user clicks on their ad and makes a purchase, or zero otherwise. In this case, the assumption that the reward distribution is static means that the profit per conversion and the conversion probability are fixed over time. This is not entirely accurate, but is a reasonable approximation of the reality, since while these parameters change over time, they tend to change at a slow pace. This reward is observed by the agent but not by us. All we observe is the cost of the arms and the arm that the agent plays. Over time, we would like to be able to “predict” which arm the agent plays. We need to be careful about the way we capture this in our model. For example, if two of the arms always have the same cost and the same reward, the agent’s choice between them is arbitrary and can never be predicted. Also, if an arm has never been played (e.g., since its cost has been infinity so far), we cannot be expected to predict the first time it is played. For these reasons, we evaluate our prediction algorithm by the regret of its actions. Our main result is an algorithm that by observing the actions of the agent learns to play the multi-armed bandit problem with a regret that is close to that of the agent. Furthermore, we show if the optimal arm, i.e., the arm with highest reward and lowest cost, is unique at every step, the number of predictions of our algorithm that is not exactly the same as the agent actions is upper bounded. Our upper bound depends on the distance between the optimal arm and the second optimal arm at every step.

Since we evaluate our algorithm by the regret of its actions, it can be seen as a regret minimization algorithm which is a very well studied subject. The distinguishing point of between our work and previous work in regret minimization is that in our setting the algorithm does not observe the payoffs (not even the payoff of the arm it selects) which is the essential input for regret minimization algorithms in the literature [6].

## 2 Related Work

The closest previous work to this paper is [17], where the authors study a model for learning an agent’s valuations based on the agent’s responses. Similar to this paper, [17] does not assume that the agent always chooses a myopically optimal action, but assumes that the agent chooses its actions using a no-regret learning algorithm. There are two main differences between the model in [17] and in our paper. The first difference is that [17] studies a single parameter setting where each agent reports a single bid, whereas we study a multi-parameter setting where the agent can pick one of many actions and the utility of each action might not be related to the others. Hence as a model one can reduce [17] to our model by disretization. Another key difference between the two papers is the metric. The goal of [17] is to study sample complexity of computing a set whose Hausdoff distance from the “rationalizable set” of valuations is not large. In the current paper the metric is regret of the algorithm with respect to the agent’s valuation. Another related work is [10], where the authors study the problem of mimicking an opponent in a 2 player gaming setting when we cannot observe the payoff and the only thing that is observable is the action of the opponent.

As we discussed in the introduction, our results can be used for bid prediction if the arms correspond to discretized values for the bids the agents submits. There are a number of papers [21, 8, 18, 5] on this subject that model different objectives and behaviors of the agents. However, most of them rely on an estimation of the agent’s private values so they can be used for bid prediction. Also, most of these papers ignore the fact that the agents often faces an uncertain environment that they learn over time, and the optimizations happen in presence of uncertainty.

Another line of related work is on designing mechanisms for agents that follow no-regret strategies. For example [4] studies an auction design problem in such a model.

Outside of computer science there is also a rich literature in Economics studying inference in auctions under equilibrium assumptions. A survey of this literature can be found in [2]. This approach has been used to study a wide variety of settings such as arbitrary normal form games [14], static first-price auctions [11], extension to risk-verse bidders [12, 7], sequential auctions [13] and sponsored search auctions [20, 1].

## 3 Model

In this section we describe our theoretical framework for predicting advertiser response based on observations about their past actions. In our model, an agent (representing an advertiser in our motivating application) plays a multi-armed bandit game with arms. In each of the time steps , each arm has a cost . These costs can be different in each time step, but they are observed by the agent and by us at the beginning of each time step. The reward (also called the value) of playing arm in any time step is drawn from a distribution with expected value . The agent does not know or , but after playing an arm, privately observes its reward. In our motivating application, each arm can correspond to a bid value the advertiser can submit. The reward of an arm is the value the advertiser receives (e.g., by selling a product through the click-through on their ad), and the cost corresponds to the amount they have to pay for their ad. In this context, the assumptions that the costs are observed by the advertiser as well as the auctioneer, that the distribution is unknown, and that the reward is observed by the advertiser but not by the auctioneer all make sense.

As the costs are different at each time step, the optimal action   for the agent can also be different. Since the agent does not know ’s, she might play an arm that is not necessarily optimal. Let be the arm that the agent picks at step . As a result of this choice, the agent accrues a regret of at time step . We assume that the agent uses an arbitrary bounded-regret strategy, i.e., her total regret up to time is bounded by a function for each time step .

The goal is to design an algorithm that in each time step , given the history of the agent actions up to this time step (i.e., the costs and the actions of the agent, but not the rewards the agent has received) and the costs of the arms in this time step, picks an arm . Because of this choice, the algorithm accrues a regret of at step . Our metric for the algorithm’s performance is measured by the total regret it achieves as compared to the regret of the agent.

Our main result is that there exists an algorithm with a regret bound of .

## 4 Prediction Algorithm

In this section, we describe our prediction algorithm. A key step in designing the algorithm is our assumption that the agent’s regret is bounded by for each time step

. This allows us to define a set of values for the agent that are consistent with their actions so far and their regret bound. A value vector

is consistent with the actions up to time if there exists a regret vector such that:

 vaℓ−cℓaℓ≥vi−cℓi−rℓ∀ℓ∈[t−1],∀i∈[k]∑j≤ℓrj≤f(ℓ)∀ℓ∈[t−1] (1)

We denote the set of consistent values at time with . Note that for every , the optimal arm is . The main idea of the algorithm is to pick an arm which is the optimal arm for the largest portion of . Formally, for each arm define as the probability that is the optimal arm for a vector chosen uniformly at random. At every time step , our algorithm picks the arm with the highest .

The time complexity of our algorithm at each time step is equivalent to the time complexity of computing the volume of polynomially many dimensional polytopes.

### 4.1 Regret Analysis

In this section we analyze the regret bound of Algorithm 1. In the main theorem of this section, Theorem 4.1, we show Algorithm 1’s predictions for the first rounds has a regret bound of . Note that after each action by the agent, the set of consistent values should satisfy the following new constraints.

 ∀j≠at,vat−vj+rt≥cat−cj

Lemma 1 will be used later in the proof of Theorem 4.1 to show that each time the prediction of the algorithm is wrong (meaning ) the set shrinks. Before stating the lemma, we need to define the following notations:

 Uij(t)=maxv∈CV(t){vi−vj}
 Lij(t)=minv∈CV(t){vi−vj}
###### Lemma 1

If the predicted arm is not the arm that is played by the agent, then

 ctat−ctpt≥Latpt(t)+18k(Uatpt(t)−Latpt(t)).
###### Proof

Let us simplify the notations by omitting some of the indices: , , , , and . Suppose

 c

for the sake of contradiction. Using Inequality (2), we show an arm exists such that its weight is higher than the weight of the arm . Therefore, we have a contradiction because the algorithm chooses an arm such that . Lemma 1 follows from this contradiction.

Let us define and . We first show is concave and non-negative in .

###### Claim

is concave and non-negative in .

###### Proof

For simplicity and without loss of generality we suppose is full dimensional. Following the definition, is the probability that a randomly drawn point from is in the half space . In other words, is ratio of the volume of intersection of and the half space over the volume of , i.e.,

 G(z)=Vol(CV(t)∩{v: va−vp

Now it is easy to see that the derivative of ,

, is the surface area of the intersection of the hyperplane

and . Therefore, the claim follows due to convexity of .

Considering Inequality (2), the following claim proves an upper bound on the weight of arm and the next claim (Claim 4.1) shows a lower bound on the sum of weights of all arms except arm , i.e, . These claims will lead to the contradiction we need.

.

###### Proof

Note that

 wp≤G(c) (3)

because we have and so

 wp≤Prv∼Unif(CV(t))[vp−cp≥va−ca]=G(c).

It suffices for the proof to show , because . By Claim 4.1 we know that is a non-negative and concave function in . Therefore, we have

 ∀x∈[L,c],   g(x)≤g(c)−γ(c−x)

where is the derivative of at point . By concavity of , we have Therefore, for every , we have

 g(x) ≤ g(c)−g(U)−g(c)U−c(c−x) ≤ g(c)+g(c)⋅c−xU−c ≤ 2g(c)

where the second inequality follows from the non-negativity of , and the last inequality holds because by Inequality (2), , and therefore for every , .

###### Proof

Note that . Therefore, by Inequality (3), we have

 ∑i:i≠pwi=1−wp≥1−G(c)=G(U)−G(c). (4)

Since is a non-negative concave function on , we have

 ∀x∈[c,U], g(x)≥g(c)+g(U)−g(c)U−c(x−c)

Therefore,

 G(U)−G(c) = ∫Ucg(x)dx ≥ ∫Uc(g(c)+g(U)−g(c)U−c)(x−c)dx = g(c)+g(U)2(U−c) ≥ g(c)2(U−c).

This, together with Inequality (4) complete the proof of Claim 4.1.

Now we show a contradiction using Claim 4.1, Claim 4.1 and Equation (2). Note that and by Claim 4.1 and Inequality (2), respectively. Therefore,

 wp≤2g(c)(c−L)≤g(c)4k(U−L)

where the first and the second inequalities follow from Claim 4.1 and Inequality (2), respectively. On the other hand, using Claim 4.1 we know there exists an arm such that

 wi≥g(c)2k(U−c).

Therefore, we have which contradicts the way is selected by Algorithm 1.

###### Theorem 4.1

Total regret of Algorithm 1 for the first rounds is bounded by .

###### Proof

To prove the theorem, we show that

 ∑t≤Tprt≤f(T)+k2λH(T)(f(T)+1) (5)

for and . Here denotes the harmonic series. Let denote the actual value vector of the arms. By the definition of regret we have

 prt = (v∗ot−ctot)−(v∗pt−ctpt) = ((v∗ot−ctot)−(v∗at−ctat))+((v∗at−ctat)−(v∗pt−ctpt)) = art+((v∗at−ctat)−(v∗pt−ctpt))

Let us define . Therefore,

 ∑t≤Tprt≤∑t≤Tart+∑t≤Tert≤f(T)+∑t≤Tert.

Therefore, to prove Inequality (5), it is enough to show . We define . Note that we have

 ∑t≤Tert = ∑α,β∑t∈Bαβ(T)ert (6) ≤ k2⋅maxα,β{∑t∈Bαβ(T)ert}.

Therefore, to prove Inequality (5), it is enough to show that for every ,

 ∑t∈Bαβ(T)ert≤λH(T)(f(T)+1).

Let us fix and . Suppose and where . We only consider cases where because . Therefore, using Lemma 1 we know . That gives

 erti=max(0,(v∗α−v∗β)−(ctiα−ctiβ))≤max(0,(v∗α−v∗β)−L(ti))

In following claim we show is bounded by .

###### Claim

For every , we have

 (v∗α−v∗β)−L(ti)≤λ(f(ti)+1)i.
###### Proof

The proof is by contradiction. Suppose there is a such that

 (v∗α−v∗β)−L(ti)>λ(f(ti)+1)i. (7)

Let be the smallest such . Therefore,

 ∀j

Let be a point that minimizes , i.e., . Note that we have because the values are bounded by 1. Let us recall the definition of here. A vector is in if such that:

 ∀t∈[ti−1]∀j: vat−ctat≥vj−ctj−rt∀t∈[ti−1]: ∑h≤trh≤f(t)\par

This can be written as:

 ∀t∈[ti−1]∀j: (vj−ctj)−(vat−ctat)≤rt∀t∈[ti−1]: ∑h≤trh≤f(t)

Since , we have

 ∑t

Note that . Therefore, we get

 ∑tj∈Bαβ(ti−1)max(0,(^vβ−ctjβ)−(^vα−ctjα))≤f(ti−1).

Note that we can write as

 ((v∗α−v∗β)−L(ti))−((v∗α−v∗β)−(ctjα−ctjβ))

because . If we combine the above equations we get

 f(ti−1) ≥ ∑j

where the second inequality follows from Inequality (7). On the other hand, we have

 (v∗α−v∗β)−(ctjα−ctjβ) ≤ (v∗α−v∗β)−((1−18k)L(tj)+18kU(tj)) (10) ≤ (1−18k)((v∗α−v∗β)−L(t)),

where the first inequality follows from Lemma 1 and the second inequality follows from the fact that . Inequalities (9) and (10) imply:

 f(ti−1) ≥ ∑j

Recall . If we apply Equation (8) into Equation (11) we get:

 f(ti−1) ≥ ∑j

where the last inequality follows from the fact that is monotone and increasing. With some straightforward calculations on the above we get:

 1≥λ(1−δ(1+∑⌊δi⌋≤j

It is easy to see since . Therefore,

 1≥λ((1−δ)−δln(1δ))

which is a contradiction because and . The claim follows from this contradiction.

By Claim 4.1,

 ∑ti∈Bαβ(T)erti ≤ ∑ti∈Bαβ(T)λ(f(ti)+1)i ≤ λ(f(T)+1)∑i≤|l|1i ≤ λ(f(T)+1)H(l)≤λ(f(T)+1)H(T),

which completes the proof of Theorem 4.1.

### 4.2 Bounding the number of wrong predictions

Note that predicting the exact arm an advertiser would choose is not always feasible. If there is more than one optimal arm, finding which one the advertiser would choose is not possible. Therefore, we need an assumption that the optimal arm is unique in every time step.

The following theorem is a corollary of Theorem 4.1. It bounds the number of wrong predictions of Algorithm 1. In this theorem, the utility of an arm is defined as the value of the arm minus the cost of playing it.

###### Theorem 4.2

If the utility of the optimal arm is higher than the utility of other arms by for every time step, then the number of mistakes is bounded by .

###### Proof

Let be the number of wrong predictions in which the algorithm chooses the optimal arm, i.e., . Note that in such time steps the agent has a regret of at least . Therefore, the overall regret of the agent is lower bounded by , and so .

Let be the number of wrong predictions in which the algorithm does not choose the optimal arm. In such time steps the algorithm has a regret of at least . Therefore, the overall regret of the algorithm is at least . Using Theorem 4.1, we get

 ma(t)≤k4(f(T)+1)log(T)δ

.

The total number of wrong predictions up to time step is .

## 5 Lower bound

In this section, we show a lower bound on the prediction regret that holds even when the regret of the agent is zero, that is, . We prove that there is no algorithm that can predict the agent’s actions with a regret bound lower than , even when .

###### Theorem 5.1

Given any algorithm , there exists a sequence of costs in which we have .

###### Proof

For simplicity suppose is even. Consider the following sequence of cost vectors.

 c1=(0,0,H,H,…,H,H)c2=(H,H,0,0,H,…,H)⋮ck/2=(H,H,…,H,H,0,0)

where is any constant bigger than 1. Formally, where

 cti={0i∈{2t,2t−1}Hotherwise (12)

Note that at each time step , the algorithm has no information about arms and . Therefore, the algorithm cannot do better than choosing at random. If we set the rewards for arms as follows

 v∗i={1i is even0i is odd (13)

then the algorithm has a regret of at every step. Therefore, the total regret will be at least

## 6 Conclusion

In this paper, we studied a multi-armed bandits setting where in each step, a cost for playing each arm is announced to the agent. We proved that if we observe an agent that achieves a regret of at most , then even without observing any rewards, we can learn to play with a regret of at most , where is the number of arms.

We used this model to capture applications like ad auctions, where the goal is to understand and predict the behavior of an advertiser with unknown utility and unobserved rewards.

There are several problems that are left open. The most natural open question is to find the best regret bound achievable in our setting. The only lower bound we know is in the case that . Also, the broader question of predicting an selfish agent’s actions in a dynamic environment without observing her rewards is open in more complicated settings.

## References

• [1] S. Athey and D. Nekipelov (2010) A structural model of sponsored search advertising auctions. Sixth ad auctions workshop. Cited by: §2.
• [2] S. Athey and P. A. Haile (2007) Nonparametric Approaches to Auctions. In Handbook of Econometrics, Handbook of Econometrics. Cited by: §2.
• [3] L. Backstrom and J. Kleinberg (2011) Network bucket testing. In Proceedings of the 20th international conference on World wide web, pp. 615–624. Cited by: §1.
• [4] M. Braverman, J. Mao, J. Schneider, and M. Weinberg (2018) Selling to a no-regret buyer. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 523–538. Cited by: §2.
• [5] A. Broder, E. Gabrilovich, V. Josifovski, G. Mavromatis, and A. Smola (2011) Bid generation for advanced match in sponsored search. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, New York, NY, USA, pp. 515–524. Cited by: §2.
• [6] S. Bubeck and N. Cesa-Bianchi (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. CoRR abs/1204.5721. Cited by: §1.
• [7] S. Campo, E. Guerre, I. Perrigne, and Q. Vuong (2003) Semiparametric Estimation of First-price Auctions with Risk Averse Bidders. Working Papers Centre de Recherche en Economie et Statistique. Cited by: §2.
• [8] M. Cary, A. Das, B. Edelman, I. Giotis, K. Heimerl, A. R. Karlin, C. Mathieu, and M. Schwarz (2007) Greedy bidding strategies for keyword auctions. In Proceedings of the 8th ACM Conference on Electronic Commerce, EC ’07, New York, NY, USA, pp. 262–271. External Links: ISBN 978-1-59593-653-0, Document Cited by: §2.
• [9] S. Chawla, J. Hartline, and D. Nekipelov (2014)

Mechanism design for data science

.
In Proceedings of the fifteenth ACM conference on Economics and computation, pp. 711–712. Cited by: footnote 2.
• [10] M. Feldman, A. Kalai, and M. Tennenholtz (2010) Playing games without observing payoffs.. In ICS, pp. 106–110. Cited by: §2.
• [11] E. Guerre, I. Perrigne, and Q. Vuong (2000-05) Optimal Nonparametric Estimation of First-Price Auctions. Econometrica 68 (3), pp. 525–574. External Links: Document Cited by: §2.
• [12] E. Guerre, I. Perrigne, and Q. Vuong (2009) Nonparametric Identification of Risk Aversion in First-Price Auctions Under Exclusion Restrictions. Econometrica. Cited by: §2.
• [13] M. Jofre-Bonet and M. Pesendorfer (2003) Estimation of a Dynamic Auction Game. Econometrica. Cited by: §2.
• [14] V. Kuleshov and O. Schrijvers (2015)

Inverse game theory: learning utilities in succinct games

.
In WINE, Cited by: §2.
• [15] K. J. Lang and R. Andersen (2007) Finding dense and isolated submarkets in a sponsored search spending graph. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 613–622. Cited by: footnote 1.
• [16] H. Mizuta and K. Steiglitz (2000) Agent-based simulation of dynamic online auctions. In Simulation Conference, 2000. Proceedings. Winter, Vol. 2, pp. 1772–1777. Cited by: §1.
• [17] D. Nekipelov, V. Syrgkanis, and E. Tardos (2015) Econometrics for learning agents. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, pp. 1–18. Cited by: §1, §2.
• [18] F. Pin and P. Key (2011) Stochastic variability in sponsored search auctions: observations and models. In Proceedings of the 12th ACM Conference on Electronic Commerce, pp. 61–70. Cited by: §2.
• [19] D. Tang, A. Agarwal, D. O’Brien, and M. Meyer (2010) Overlapping experiment infrastructure: more, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 17–26. Cited by: §1.
• [20] H. R. Varian (2007) Position auctions. International Journal of Industrial Organization. Cited by: §2.
• [21] H. Xu, B. Gao, D. Yang, and T. Liu (2013-05) Predicting advertiser bidding behaviors in sponsored search by rationality modeling. In Proceedings of the 22nd international conference on World Wide Web, Cited by: §2.