1 Introduction
Consider a bidder trying to decide how much to bid in an auction (for example, a sponsored search auction). If the auction happens to be the truthful VickreyClarkeGroves auction [Vic61, Cla71, Gro73], then the bidder’s decision is easy: simply bid your value. If instead, the bidder is participating in a Generalized FirstPrice (GFP) or Generalized SecondPrice (GSP) auction, the optimal strategy is less clear. Bidders can certainly attempt to compute a BayesNash equilibrium of the associated game and play accordingly, but this is unrealistic due to the need for accurate priors and extensive computation.
Alternatively, the bidders may try to learn a bestresponse over time (possibly offloading the learning to commercial bid optimizers). We specifically consider bidders who noregret learn, as empirical work of Nekipelov et al. [NST15] shows that bidder behavior on Bing is largely consistent with noregret learning (i.e. for most bidders, there exists a perclick value such that their behavior guarantees noregret for this value). From the perspective of a revenuemaximizing auction designer, this motivates the following question: If a seller knows that buyers are noregret learning over time, how should they maximize revenue?
This question is already quite interesting even when there is just a single item for sale to a single buyer.^{1}^{1}1And also surprisingly relevant: search engines don’t generally publish their formulas for setting reserves. So even if you are the only bidder for a certain keyword (e.g. the name of your new startup), you’re likely participating in a GSP/GFP auction with no additional bidders, but against a seller who adaptively sets the reserve price based on past bids. Anecdotal evidence indeed suggests that the reserve prices in such singlebidder auctions will change over time. We consider a model where in every round , the seller solicits a bid from the buyer, then allocates the item according to some allocation rule and charges the bidder according to some pricing rule (satisfying for all ). Note that the allocation and pricing rules (henceforth, auction) can differ from round to round, and that the auction need not be truthful. Each round, the bidder has a value drawn independently from , and uses some noregret learning algorithm to decide which bid to place in round , based on the outcomes in rounds (we will make clear exactly what it means for a buyer with changing valuation to play noregret in Section 2, but one can think of as providing a “context” for the bidder during round ).
One default strategy for the seller is to simply to set Myerson’s revenueoptimal reserve price for , , in every round (that is, , for all , where is the indicator function). It’s not hard to see that any noregret learning algorithm will eventually learn to submit a winning bid during all rounds where , and a losing bid whenever . So if denotes the expected revenue of the optimal reserve price when a single buyer is drawn from , the default strategy guarantees the seller revenue over rounds. The question then becomes whether or not the seller can beat this benchmark, and if so by how much.
The answer to this question isn’t a clearcut yes or no, so let’s start with the following instantiation: how much revenue can the seller extract if the buyer runs EXP3 [ACBFS03]? In Theorem 3.1, we show that the seller can actually do much better than the default strategy: it’s possible to extract revenue per round equal to (almost) the full expected welfare! That is, if , there exists an auction that extracts revenue for all .^{2}^{2}2The order of quantifiers in this sentence is correct: it is actually the same auction format that works for all . It turns out this result holds not only for EXP3, but for any learning algorithm with the following (roughly stated) property: if at time , the mean reward of action is significantly larger than the mean reward of action , the learning algorithm will choose action
with negligible probability. We call a learning algorithm with this property a “meanbased” learning algorithm and note that many commonly used learning algorithms  EXP3, Multiplicative Weights Update
[AHK12], and FollowthePerturbedLeader [Han57, KV02, KV05]  are ‘meanbased’ (see Section 2 for a formal definition).We postpone all intuition until Section 3.1 with a workedthrough example, but just note here that the auction format is quite unnatural: it “lures” the bidder into submitting high bids early on by giving away the item for free, and then charging very high prices (but still bounded in ) near the end. The transition from “free” to “highprice” is carefully coordinated across different bids to achieve the revenue guarantee.
This result motivates two further directions. First, do there exist other noregret algorithms for which full surplus extraction is impossible for the seller? In Theorem 3.2, we show that the answer is yes. In fact, there is a simple noregret algorithm , such that when the bidder uses algorithm to bid, the default strategy (set the Myerson reserve every round) is optimal for the seller. We again postpone a formal statement and intuition to Section 3.2, but just note here that the algorithm is a natural adaptation of EXP3 (or in fact, any existing noregret algorithm) to our setting.
Finally, it is reasonable to expect that bidders might use offtheshelf noregret learning algorithms like EXP3, so it is still important to understand what the seller can hope to achieve if the buyer is specifically using such a “meanbased” algorithm (formal definition in Section 2). Theorem 3.1 is perhaps unsatisfying in this regard because the proposed auction is so unnatural, and looks nothing like the GSP or GFP auctions that initially motivated this study. It turns out that the key property separating GFP/GSP from the unnatural auction above is whether overbidding is a dominated strategy. That is, in our unnatural auction, if the bidder truly hopes to guarantee low regret they must seriously consider overbidding (and this is how the auction lures them into bidding way above their value). In both GSP and GFP, overbidding is dominated, so the bidder can guarantee no regret while overbidding with probability in every round.
The final question we ask is the following: if the buyer is using EXP3 (or any “meanbased” algorithm), but only considering undominated strategies, how much revenue can the seller extract using an auction where overbidding is dominated in every round? It turns out that the auctioneer can still outperform the default strategy, but not extract full welfare. Instead, we identify a linear program (as a function of ) that tightly characterizes the optimal revenue the seller can achieve in this setting when the buyer’s values are drawn from . Moreover, we show that the auction that achieves this guarantee is natural, and can be thought of as a firstprice auction with decreasing reserves over time. Finally, we show that this “meanbased revenue” benchmark, lies truly in between the Myerson revenue and the expected welfare: for all , there exists a distribution over values such that . In other words, the seller’s meanbased revenue may be unboundedly better than the default strategy, yet simultaneously unboundedly far from the expected welfare. We provide formal statements and a detailed proof overview of these results in Section 3.3. To briefly recap, our main results are the following:

If the buyer uses a “meanbased” learning algorithm like EXP3, the seller can extract revenue for any constant (Theorem 3.1).

There exists a natural noregret algorithm such that when the buyer bids according to , the seller’s default strategy (charging the Myerson reserve every round) is optimal (Theorem 3.2).

If the buyer uses a “meanbased” algorithm only over undominated strategies, the seller can extract revenue using an auction where overbidding is dominated in every round. Moreover, we characterize as the value of a linear program, and show it can be simultaneously unboundedly better than and unboundedly worse than (Theorems 3.4, 3.3 and 3.5).
Our plan for the remaining sections is as follows. Below, we overview our connection to related work. Section 2 formally defines our model. Section 3 works through a concrete example, providing intuition for all three results. Section 4 discusses conclusions and open problems.
1.1 Related Work
There are two lines of work that are most related to ours. The first is that of dynamic auctions, such as [PPPR16, ADH16, MLTZ16a, MLTZ16b, LP17]. Like our model, there are rounds where the seller has a single item for sale to a single buyer, whose value is drawn from some distribution every round. However, the buyer is fully strategic and processes fully how their choices today affect the seller’s decisions tomorrow (e.g. they engage with deals of the form “pay today to get the item tomorrow”). Additional closely related work is that of Devanur et al. studying the Fishmonger problem [DPS15, ILPT17]. Here, there is again a single buyer and seller, and rounds of sale. Unlike our model, the buyer draws a value from once during round and that value is fixed through all rounds (so the seller could try to learn the buyer’s value over time). Also unlike our model, they study perfect Bayesian equilibria (where again the buyer is fully strategic, and reasons about how their actions today affect the seller’s behavior tomorrow).
In contrast to these works, while buyers in our model do care about the future (e.g. they value learning), they don’t reason about how their actions today might affect the seller’s decisions tomorrow. Our model is more realistic for sponsored search auctions, where search engines rarely release proprietary algorithms for setting reserves based on past data (and fully strategic reasoning is simply impossible without the necessary information).
Other related work considers the Price of Anarchy of simple combinatorial auctions when bidders noregret learn [Rou12, ST13, NST15, DS16]. One key difference between this line of work and ours is that these all study welfare maximization for combinatorial auctions with rich valuation functions. In contrast, our work studies revenue maximization while selling a single item. Additionally, in these works the seller commits to a publicly known auction format, and the only reason for learning is due to the strategic behavior of other buyers. In contrast, buyers in our model have to learn even when they are the only buyer, due to the strategic nature of the seller.
Recent work has also considered learning from the perspective of the seller. In these works, the buyer’s (or buyers’) valuations are drawn from an unknown distribution, and the seller’s goal is to learn an approximately optimal auction with as few samples as possible [CR14, DHP16, MR15, MR16, GN17, CD17, DHL17]. These works consider numerous different models and achieve a wide range of guarantees, but all study the learning problem from the perspective of the seller, whereas the buyer is simply myopic and participates in only one round. In contrast, it is the buyer in our model who does the learning (and there is no information for the seller to learn: the buyer’s values are drawn fresh in every round).
Finally, noregret learning in online decision problems is an extremely wellstudied problem. When feedback is revealed for every possible action, one wellknown solution is the multiplicative weight update rule which has been rediscovered and applied in many fields (see survey [AHK12] for more details). Another algorithmic scheme for the online decision problem is known as Follow the Perturbed Leader [Han57, KV02, KV05]. When only feedback for the selected action is revealed, the problem is referred to as the multiarmed bandit problem. Here, similar ideas to the MWU rule are used in developing the EXP3 algorithm [ACBFS03] for adversarial bandit model, and also for the contextual bandit problem [LZ08]. Our algorithm in Theorem 3.2 bears some similarities to the low swap regret algorithm introduced in [BM07]. See the survey [BC12] for more details about the multiarmed bandit problem. Our results hold in both models (i.e. whether the buyer receives feedback for every bid they could have made, or only the bid they actually make), so we will make use of both classes of algorithms.
In summary, while there is already extensive work related to repeated sales in auctions, and even noregret learning with respect to auctions (from both the buyer and seller perspective), our work is the first to address how a seller might adapt their selling strategy when faced with a noregret buyer.
2 Model and Preliminaries
We consider a setting with 1 buyer and 1 seller. There are rounds, and in each round the seller has one item for sale. At the start of each round , the buyer’s value (known only to the buyer) for the item is drawn independently from some distribution (known to both the seller and the buyer). For simplicity, we assume has a finite support^{3}^{3}3If instead has infinite support, all our results hold approximately after discretization to multiples of . If is bounded in , then all our results hold after normalizing by dividing by . of size , supported on values . For each , has probability of being drawn under .
The seller then presents options for the buyer, which can be thought of as “possible bids” (we will interchangeably refer to these as options, bids, or arms throughout the paper, depending on context). Each arm is labelled with a bid value , with . Upon pulling this arm at round , the buyer receives the item with some allocation probability , and must pay a price . These values and are chosen by the seller during time , but remain unknown to the buyer until he plays an arm, upon which he learns the values for that arm. All of our positive results (i.e. strategies for the seller) are nonadaptive (in some places called oblivious), in the sense that that are set before the first round starts. All of our negative results (i.e. upper bounds on how much a seller can possibly attain) hold even against fully adaptive sellers, where and can be set even after learning the distribution of arms the buyer intends to pull in round .
In order for the selling strategies to possibly represent sponsored search auctions, we require the allocation/price rules to be monotone. That is, if , then for all , and . In other words, bidding higher should result in a (weakly) higher probability of receiving the item and (weakly) higher expected payment. We’ll also insist on the existence of an arm with bid and for all ; i.e., an arm which charges nothing but does not give the item. Playing this arm can be thought of as not participating in the auction.
We’ll be interested in one final property of allocation/price rules that we call critical, and buyer behavior that we call clever. We won’t require that all auctions considered be critical, but this is an important property that greatly affects the optimal revenue that a seller can extract (see Theorems 3.1 and 3.3).
Definition 2.1 (Clever Bidder).
We say that a bidder is clever if they never play a dominated strategy. That is, they still noregret learn, but only over the set of bids which are not dominated.
Definition 2.2 (Critical Auction).
A vector of allocation/price rules (over all
) is critical if for all , overbidding is a dominated strategy.The above definition captures the property that in many auctions like GFP and GSP (both of which are critical), it makes no sense for a buyer to ever play dominated strategies  they need only learn over the undominated strategies. Note that if overbidding is strictly dominated, any lowregret or meanbased learning algorithm will quickly learn not to overbid, and therefore play similarly to clever bidders in critical auctions.
2.1 Bandits and experts
Our goal is to understand the behavior of such mechanisms when the buyer plays according to some noregret strategy for the multiarmed bandit problem. In the classic multiarmed bandit problem a learner (in our case, the buyer) chooses one of arms per round, over rounds. On round , the learner receives a reward for pulling arm (where the values are possibly chosen adversarially). The learner’s goal is to maximize his total reward.
Let denote the arm pulled by the principal at round . The regret of an algorithm
for the learner is the random variable
. We say an algorithm for the multiarmed bandit problem is noregret if (where the expectation is taken over the randomness of ). We say an algorithm is noregret if it is noregret for some .In the multiarmed bandits setting, the learner only learns the value for the arm which he pulls on round . In our setting, the learner will learn and explicitly (from which they can compute ). Our results (both positive and negative) also hold when the learner learns the value for all arms (we refer this fullinformation setting as the experts setting, in contrast to the partialinformation bandits setting). Simple noregret algorithms exist in both the experts setting and the bandits setting. Of special interest in this paper will be a class of learning algorithms for the bandits problem and experts problem which we term ‘meanbased’.
Definition 2.3 (MeanBased Learning Algorithm).
Let . An algorithm for the experts problem or multiarmed bandits problem is meanbased if it is the case that whenever , then the probability that the algorithm pulls arm on round is at most . We say an algorithm is meanbased if it is meanbased for some .
Intuitively, ‘meanbased’ algorithms will rarely pick an arm whose current mean is significantly worse than the current best mean. Many noregret algorithms, including commonly used variants of EXP3 (for the bandits setting), the Multiplicative Weights algorithm (for the experts setting) and the FollowthePerturbedLeader algorithm (experts setting), are meanbased (Appendix D).
Contextual bandits
In our setting, the buyer has the additional information of their current value for the item, and hence is actually facing a contextual bandits problem. In (our variant of) the contextual bandits problem, each round the learner is additionally provided with a context drawn from some distribution supported on a finite set (in our setting, , the buyer’s valuation for the item at time ). The adversary now specifies rewards , the reward the learner receives if he pulls arm on round while having context . If we are in the fullinformation (experts) setting, the learner learns the values of for all arms after round , where as if we are in the partialinformation (bandits) setting, the learner only learns the value of for the arm that he pulled.
In the contextual bandits setting, we now define the regret of an algorithm in terms of regret against the best “contextspecific” policy ; that is, , where again is the arm pulled by on round . As before, we say an algorithm is low regret if , and say an algorithm is noregret if it is noregret for some .
If the size of the context set is constant with respect to , then there is a simple way to construct a noregret algorithm for the contextual bandits problem from a noregret algorithm for the classic bandits problem: simply maintain a separate instance of for every different context (in the contextual bandits literature, this is sometimes referred to as the EXP3 algorithm [BC12]). We call the algorithm we obtain this way its contextualization, and denote it as .
If we start with a meanbased learning algorithm, then we can show that its contextualization satisfies an analogue of the meanbased property for the contextualbandits problem (proof in Appendix D).
Definition 2.4 (MeanBased Contextual Learning Algorithm).
Let . An algorithm for the contextual bandits problem is meanbased if it is the case that whenever , then the probability that the algorithm pulls arm on round if it has context satisfying . We say an algorithm is meanbased if it is meanbased for some .
Theorem 2.5.
If an algorithm for the experts problem or multiarmed bandits problem is meanbased, then its contextualization is also a meanbased algorithm for the contextual bandits problem.
2.2 Welfare and monopoly revenue
In order to evaluate the performance of our mechanisms for the seller, we will compare the revenue the seller obtains to two benchmarks from the singleround setting of a seller selling a single item to a buyer with value drawn from distribution .
The first benchmark we consider is the welfare of the buyer, the expected value the buyer assigns to the item. This quantity clearly upper bounds the expected revenue that the seller can hope to extract per round.
Definition 2.6.
The welfare, is equal to .
The second benchmark we consider is the monopoly revenue, the maximum possible revenue attainable by the seller in one round against a rational buyer. Seminal work of Myerson [Mye81] shows that this revenue is attainable by setting a fixed price (“monopoly/Myerson reserve”) for the item, and hence can be characterized as follows.
Definition 2.7.
The monopoly revenue (alternatively, Myerson revenue) is equal to .
2.3 A final note on the model
For concreteness, we chose to phrase our problem as one where a single bidder whose value is repeatedly drawn independently from each round engages in noregret learning with their value as context. Alternatively, we could imagine a population of different buyers, each with a fixed value . Each round, exactly one buyer arrives at the auction, and it is buyer with probability . The buyers are indistinguishable to the seller, and each buyer noregret learns (without context, because their value is always . This model is mathematically equivalent to ours, so all of our results hold in this model as well if the reader prefers this interpretation instead.
3 An Illustrative Example
In this section, we overview an illustrative example to show the difference between meanbased and nonmeanbased learning algorithms, and between critical and arbitrary auctions. We will not prove all claims in this section (nor carry out all calculations) as it is only meant to illustrate and provide intuition. Throughout this section, the running example will be when samples with probability , with probability , and with probability . Note that and .
3.1 MeanBased Learning and Arbitrary Auctions
Let’s first consider what the seller can do with an arbitrary (not critical) auction when the buyer is running a meanbased learning algorithm like EXP3. The seller will let the buyer bid or . If the buyer bids , they pay nothing but do not receive the item (recall that an arm of this form is required). If the buyer bids in round , they receive the item and pay some price as follows: for the first half of the game (), the seller sets . For the second half of the game (), the seller sets .
Let’s examine the behaviour of the buyer, recalling that they run a meanbased learning algorithm, and therefore (almost) always pull the arm with highest cumulative utility. The buyer with value will happily bid all the way through, since he is always offered the item for less than or equal to his value for the item. The buyer with value will bid for the first rounds, accumulating a surplus (i.e., negative regret) of per round. For the next rounds, this surplus slowly disappears at the rate of per round until it disappears at time , so the bidder with value will bid all the way through. Finally, the bidder with value will bid for the first rounds, accumulating surplus at a rate of per round. After round , this surplus decreases at a rate of per round, until at round his cumulative utility from bidding reaches and he switches to bidding .
Now let’s compute the revenue. From round through , the buyer always buys the item at a price of , so the seller obtains revenue. Finally, from round through , the buyer purchases the item with probability and pays . The total revenue is . Note that if the seller used the default strategy, they would extract revenue only .
Where did our extra revenue come from? First, note that the welfare of the buyer in this example is quite high: the bidder gets the item the whole way through when , and twothirds of the way through when . One reason why the welfare is so high is because we give the item away for free in the early rounds. But notice also that the utility of the buyer is quite low: the buyer actually has zero utility when , and utility when . The reason we’re able to keep the utility low, despite giving the item away for free in the early rounds is because we overcharge the bidders in later rounds (and they choose to overpay, exactly because their learning is meanbased).
In fact, by offering additional options to the buyer, we show that it is possible for the seller to extract up to the full welfare from the buyer (e.g. a net revenue of for this example). As in the above example, our mechanism makes use of arms which are initially very good for the buyer (giving the item away for free, accumulating negative regret), followed by a period where they are very bad for the buyer (where they pay more than their value). The trick in the construction is making sure that the good/bad intervals line up so that: a) the buyer purchases the item in every round, no matter their value (this is necessary in order to possibly extract full welfare) and b) by round , the buyer has zero (arbitrarily small) utility, no matter their value.
Getting the intervals to line up properly so that any meanbased learner will pick the desired arms still requires some work. But interestingly, our constructed mechanism is nonadaptive and priorindependent (i.e. the same mechanism extracts full welfare for all ). Theorem 3.1 below formally states the guarantees. The construction itself and the proof appear in Appendix B.
Theorem 3.1.
If the buyer is running a meanbased algorithm, for any constant , there exists a strategy for the seller which obtains revenue at least .
Two properties should jump out as key in enabling the result above. The first is that the buyer only has no regret towards fixed arms and not towards the policy they would have used with a lower value (this is what leads the buyer to continue bidding with value even though they have already learned to bid with value ). This suggests an avenue towards an improved learning algorithm: have the bidder attempt to have no regret not only towards each fixed arm, but also towards the policy of play produced when having different values. This turns out to be exactly the right idea, and is discussed in the following subsection below.
The second key property is that we were able to “lure” the bidders into playing an arm with a free item, then overcharge them later to make up for lost revenue. This requires that the bidder consider pulling an arm with maximum bid exceeding their value, which will never happen in a critical auction with clever bidders. It turns out it is still possible to do better than the default strategy with a critical auction against clever bidders, but not as well as with an arbitrary auction. Section 3.3 explores critical auctions for this example.
3.2 Better Learning and Arbitrary Auctions
In our bad example above, the buyer with value for the item slowly spends the second half of the game losing utility. While his behaviour is still noregret (he ends up with zero net utility, which indeed is at least as good as only bidding ), he would have been much happier to follow the actions of the buyer with value , who started bidding at .
Using this idea, we show how to construct a noregret algorithm for the buyer such that the seller receives at most the Myerson revenue every round. We accomplish this by extending an arbitrary noregret algorithm (e.g. EXP3) by introducing “virtual arms” for each value, so that each buyer with value has low regret not just with respect to every fixed bid, but also noregret with respect to the policy of play as if they had a different value for the item (for all ). In some ways, our construction is very similar to the construction of low internalregret (or swapregret) algorithms from low externalregret algorithms. The main difference is that instead of having low regret with respect to swapping actions, we have low regret with respect to swapping contexts (i.e. values). Theorem 3.2 below states that the seller cannot outperform the default strategy against buyers who use such algorithms to learn.
Theorem 3.2.
There exists a noregret algorithm for the buyer against which every seller strategy extracts no more than revenue.
The algorithm’s description and proof appear in Appendix A. The key observation in the proof is that “not regretting playing as if my value were ” sounds a lot like “not preferring to report value instead of .” This suggests that the aggregate allocation probabilities and prices paid by any buyer using our algorithm should satisfy the same constraints as a truthful auction, proving that the resulting revenue cannot exceed the default strategy (and indeed the proof follows this approach).
3.3 MeanBased Learning and Critical Auctions
Recall in our example that to extract revenue , bidders with values and had to consider bidding . If the seller is using a critical auction, overbidding is dominated, so there is no reason for bidders to do this. In fact, the analysis and results of this section hold as long as the bidders never consider overbidding (even if the auction isn’t critical).
Although the auction in Section 3.1 is no longer viable, consider the following auction instead: in addition to the zero arm, the bidder can bid or . If they bid in any round, they will get the item with probability and pay . If they bid in round , they get nothing. If they bid in round , they get the item and pay . Let’s again see what the bidder will choose to do, remembering that they will always pull the arm that has provided highest cumulative utility (due to being meanbased).
Clearly, the bidder with value will bid every round (since they are clever, they won’t even consider bidding ), making a total payment of . The bidder with value will bid for the first rounds, and then immediately switch to bidding , making a total payment of .
The bidder with value will actually bid for the entire rounds. To see this, observe that their cumulative surplus through round from bidding is ( rounds by utility per round by probability of having value ). Their cumulative surplus through round from bidding is instead (for ). Because they are meanbased, they will indeed bid for the entire duration due to its strictly higher utility. So their total payment will be . The total revenue is then , again surpassing the default strategy (but not reaching the achieved by our noncritical auction).
Let’s again see where our extra revenue comes from in comparison to a truthful auction. Notice that the bidder receives the item with probability conditioned on having value , and also conditioned on having value . Yet somehow the bidder pays an average of conditioned on having value , but an average of conditioned on having value . This could never happen in a truthful auction, as the bidder would strictly prefer to pretend their value was rather than . But it is entirely possible when the buyer does meanbased learning, as evidenced by this example.
In Appendix C, we define as the value of the LP in Figure 1. In Theorems 3.4 and 3.3, we show that tightly characterizes (up to ) the optimal revenue a seller can extract with a critical auction against a clever buyer. We state the theorem statements more generally to remind the reader that they hold as long as the buyer never overbids (even if the auction is arbitrary). The proofs can be found in Appendix C.1.
maximize  
subject to  
Before stating our theorems, let’s parse this LP. is a constant representing the probability that the buyer has value (also a constant). is a variable representing the average probability that the bidder gets the item with value , and is a variable representing the average utility of the bidder when having value . Therefore, this bidder’s average value is , the average price they pay is , and the objective function is simply the average revenue. The second constraints are just normalization, ensuring that everything lies in . The first line of constraints are the interesting ones. These look a lot like IC constraints that a truthful auction must satisfy, but something’s missing: the LHS is clearly the utility of the buyer with value for “telling the truth,” but the utility of the buyer for “reporting instead” is . So the term is missing on the RHS.
Let’s also see a very brief proof outline for why no seller can extract more revenue than :

Because the buyer has no regret conditioned on having value , their utility is at least as high as playing arm every round.

Because the auction never charges arm more than (conditioned on awarding the item), the buyer’s utility for playing arm every round is at least , where is the average probability that arm awards the item.

Because the auction is monotone, and the buyer never considers overbidding, if the buyer gets the item with probability conditioned on having value , we must have .
These three facts together show that no seller can extract more than against a noregret buyer who doesn’t overbid. Observe also that step 3 is exactly the step that doesn’t hold for buyers who consider overbidding (and is exactly what’s violated in our example in Section 3.1): if the buyer ever overbids, then they might receive the item with higher probability than had they just played their own arm every round.
Theorem 3.3.
Any strategy for the seller achieves revenue at most against a buyer running a noregret algorithm who overbids with probability .
Theorem 3.4.
For any constant , there exists a strategy for the seller gets revenue at least against a buyer running a meanbased algorithm who overbids with probability . The strategy sets a decreasing cutoff and for all awards the item with probability to any bid for price , and with probability to any bid .
Theorem 3.5.
For distributions supported on , , and there exist supported on such that . For this same , .^{4}^{4}4The promised is the equalrevenue curve truncated at .
3.4 A Final Note on the Example
While reading through our examples, the reader may think that the meanbased learner’s behavior is clearly irrational: why would you continue paying above your value? Why would you continue paying more than necessary, when you can safely get the item for less?
But this is exactly the point: a more thoughtful learner can indeed do better (for instance, by using the algorithm of Section 3.2). It is also perhaps misleading to believe that the bidder should “obviously” stop overpaying: we only know this because we know the structure of the example. But in principle, how is the bidder supposed to know that the overcharged rounds are the new norm and not an anomaly? Given that most standard noregret algorithms are meanbased, it’s important to nail down the seller’s options for exploiting this behavior.
4 Conclusion and Future Directions
Motivated by the prevalence of bidders noregret learning to play nontruthful auctions in practice [NST15], we consider a revenuemaximizing seller with a single item (each round) to sell to a single buyer. We show that when the buyer uses meanbased algorithms like EXP3, the seller can extract revenue equal to the expected welfare with an unnatural auction. We then provide a modified noregret algorithm such that the seller cannot extract revenue exceeding the monopoly revenue when the buyer bids according to . Finally, we consider a meanbased buyer who never overbids. We tightly characterize the seller’s optimal revenue with a linear program, and show that a payyourbid auction with decreasing reserves over time achieves this guarantee. Moreover, we show that the meanbased revenue can be unboundedly better than the monopoly revenue while simultaneously worse than the expected welfare. In particular, for the equal revenue curve truncated at , the monopoly revenue is , the expected welfare is , and the meanbased revenue is .
While our work has already shown the singlebuyer problem is quite interesting, the most natural direction for future work is understanding revenue maximization with multiple learning buyers. Of our three main results, only Theorem 3.2 extends easily (that if every buyer uses our modified learning, the default strategy, which now runs Myerson’s optimal auction every round, is optimal; see Theorem A.5 for details). Our work certainly provides good insight into the multibidder problem, but there are still clear barriers. For example, in order to obtain revenue equal to the expected welfare, the auction must necessarily also maximize welfare. In our singlebidder model, this means that we can give away the item for free for rounds, but with multiple bidders, such careless behaviour would immediately make it impossible to achieve the optimal welfare. Regarding the meanbased revenue, while there is a natural generalization of our LP to multiple bidders, it’s no longer clear how to achieve this revenue with a critical auction, as all the relevant variables now implicitly depend on the actions of the other bidders. These are just examples of concrete barriers, and there are likely interesting conceptual barriers for this extension as well.
Another interesting direction is understanding the consequences of our work from the perspective of the buyer. Aside from certain corner configurations (e.g. the seller extracting the buyer’s full welfare), it’s not obvious how the buyer’s utility changes. For instance, is it possible that the buyer’s utility actually increases as the seller switches from the default strategy to the optimal meanbased revenue? Does the buyer ever benefit from using an “exploitable” learning strategy, so that the seller can exploit it and make them both happier?
References
 [ACBFS03] Peter Auer, Nicolò CesaBianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 32(1):48–77, January 2003.
 [ADH16] Itai Ashlagi, Constantinos Daskalakis, and Nima Haghpanah. Sequential mechanisms with expost participation guarantees. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, Maastricht, The Netherlands, July 2428, 2016, pages 213–214, 2016.
 [AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a metaalgorithm and applications. Theory of Computing, 8(6):121–164, 2012.

[BC12]
Sébastien Bubeck and Nicolò CesaBianchi.
Regret analysis of stochastic and nonstochastic multiarmed bandit
problems.
Foundations and Trends in Machine Learning
, 5(1):1–122, 2012.  [BM07] Avrim Blum and Yishay Mansour. From external to internal regret. Journal of Machine Learning Research, 8:1307–1324, 2007.
 [CD17] Yang Cai and Constantinos Daskalakis. Learning multiitem auctions with (or without) samples. In FOCS, 2017.
 [Cla71] Edward H. Clarke. Multipart Pricing of Public Goods. Public Choice, 11(1):17–33, 1971.
 [CR14] Richard Cole and Tim Roughgarden. The sample complexity of revenue maximization. In Proceedings of the Fortysixth Annual ACM Symposium on Theory of Computing, STOC ’14, pages 243–252, New York, NY, USA, 2014. ACM.
 [DHL17] Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E. Schapire, Vasilis Syrgkanis, and Jennifer Wortman Vaughan. Oracleefficient learning and auction design. In FOCS, 2017.
 [DHP16] Nikhil R. Devanur, Zhiyi Huang, and ChristosAlexandros Psomas. The sample complexity of auctions with side information. In Proceedings of the Fortyeighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 426–439, New York, NY, USA, 2016. ACM.
 [DPS15] Nikhil R. Devanur, Yuval Peres, and Balasubramanian Sivan. Perfect bayesian equilibria in repeated sales. In Proceedings of the Twentysixth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’15, pages 983–1002, Philadelphia, PA, USA, 2015. Society for Industrial and Applied Mathematics.
 [DS16] Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. In IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS 2016, 911 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA, pages 219–228, 2016.
 [DW12] Constantinos Daskalakis and S. Matthew Weinberg. Symmetries and Optimal MultiDimensional Mechanism Design. In the 13th ACM Conference on Electronic Commerce (EC), 2012.
 [GN17] Yannai A. Gonczarowski and Noam Nisan. Efficient empirical revenue maximization in singleparameter auction environments. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pages 856–868, New York, NY, USA, 2017. ACM.
 [Gro73] Theodore Groves. Incentives in Teams. Econometrica, 41(4):617–631, 1973.
 [Han57] James Hannan. Approximation to bayes risk in repeated play. In Contributions to the Theory of Games, pages 3:97–139, 1957.
 [ILPT17] Nicole Immorlica, Brendan Lucier, Emmanouil Pountourakis, and Samuel Taggart. Repeated sales with multiple strategic buyers. In Proceedings of the 2017 ACM Conference on Economics and Computation, pages 167–168. ACM, 2017.
 [KV02] Adam Kalai and Santosh Vempala. Geometric algorithms for online optimization. In Journal of Computer and System Sciences, pages 26–40, 2002.
 [KV05] Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. J. Comput. Syst. Sci., 71(3):291–307, October 2005.
 [LP17] Siqi Liu and ChristosAlexandros Psomas. On the competition complexity of dynamic mechanism design. CoRR, abs/1709.07955, 2017.

[LZ08]
John Langford and Tong Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors, Advances in Neural Information Processing Systems 20, pages 817–824. Curran Associates, Inc., 2008. 
[MLTZ16a]
Vahab S. Mirrokni, Renato Paes Leme, Pingzhong Tang, and Song Zuo.
Dynamic auctions with bank accounts.
In
Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 915 July 2016
, pages 387–393, 2016.  [MLTZ16b] Vahab S. Mirrokni, Renato Paes Leme, Pingzhong Tang, and Song Zuo. Optimal dynamic mechanisms with expost IR via bank accounts. CoRR, abs/1605.08840, 2016.
 [MR15] Jamie Morgenstern and Tim Roughgarden. The pseudodimension of nearoptimal auctions. In Proceedings of the 28th International Conference on Neural Information Processing Systems  Volume 1, NIPS’15, pages 136–144, Cambridge, MA, USA, 2015. MIT Press.
 [MR16] Jamie Morgenstern and Tim Roughgarden. Learning simple auctions. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1298–1318, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
 [Mye81] Roger B. Myerson. Optimal Auction Design. Mathematics of Operations Research, 6(1):58–73, 1981.
 [NST15] Denis Nekipelov, Vasilis Syrgkanis, and Eva Tardos. Econometrics for learning agents. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, pages 1–18, New York, NY, USA, 2015. ACM.
 [PPPR16] Christos Papadimitriou, George Pierrakos, ChristosAlexandros Psomas, and Aviad Rubinstein. On the complexity of dynamic mechanism design. In Proceedings of the Twentyseventh Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’16, pages 1458–1475, Philadelphia, PA, USA, 2016. Society for Industrial and Applied Mathematics.
 [Rou12] Tim Roughgarden. The price of anarchy in games of incomplete information. In Proceedings of the 13th ACM Conference on Electronic Commerce, EC ’12, pages 862–879, New York, NY, USA, 2012. ACM.
 [ST13] Vasilis Syrgkanis and Eva Tardos. Composable and efficient mechanisms. In Proceedings of the Fortyfifth Annual ACM Symposium on Theory of Computing, STOC ’13, pages 211–220, New York, NY, USA, 2013. ACM.
 [Vic61] William Vickrey. Counterspeculations, Auctions, and Competitive Sealed Tenders. Journal of Finance, 16(1):8–37, 1961.
Appendix A Good noregret algorithms for the buyer
In this section we show that there exists a (contextual) noregret algorithm for the buyer which guarantees that the seller receives at most the Myerson revenue per round (i.e., in total). As mentioned earlier, it does not suffice for the buyer to simply run the contextualization for some noregret learning algorithm (and in fact, if is meanbased, the seller can extract strictly more than , as we will see later). However, by modifying so that it has not just noregret with respect to the best stationary policy, but so that it additionally does not regret playing as if it had some other context, we obtain a noregret algorithm for the buyer which guarantees the seller receives no more than per round.
The details of the algorithm are presented in Algorithm 1. Recall that the distribution is supported over values , where for each , has probability under . The algorithm takes a noregret algorithm for the classic multiarmed bandit problem, and runs instances of it, one per possible value . Each instance of learns not only over the possible actions, but also over virtual actions corresponding to values through . Picking the virtual action associated with corresponds to the buyer pretending they have value , and playing accordingly (i.e., querying ).
This algorithm is very similar in structure to the construction of a low swapregret bandits algorithm from a generic noregret bandits algorithm (see [BM07]). The main difference is that whereas swap regret guarantees noregret with respect to swapping actions (i.e. always playing action instead of action ), this algorithm guarantees noregret with respect to swapping contexts (i.e., always pretending you have context when you actually have context ). In addition, the auction structure of our problem allows us to only consider contexts with valuations smaller than our current valuation ; this puts a limit of
on the number of recursive calls per round, as opposed to the low swap regret algorithm where one must solve for the stationary distribution of a Markov chain over
states each round.We now proceed to show that Algorithm 1 has our desired guarantees.
Theorem A.1.
Let . If the buyer plays according to Algorithm 1 then the seller (even if they play an adaptive strategy) receives no more than revenue.
Proof.
For each , define to be the expected number of rounds the buyer receives the item when they have value . For each define to be the expected total payment from the buyer to the seller when the buyer has value . Our goal is to upper bound , the total revenue the seller receives.
Recall that every strategy must contain a zero option in its menu, where the buyer pays nothing and doesn’t receive the item (and hence receives zero utility). Since each is a noregret algorithm, we know that the buyer does not regret always choosing the zero option when they have value . It follows that, for all , we have that
(1) 
The following lemma shows that when , the buyer does not regret pretending to have value when they have value .
Lemma A.2.
For all ,
Proof.
From the algorithm, we know that does not regret always playing the value arm corresponding to . We define the following notation. For all and any history of rounds (including for each round which option is chosen and the utility of that round), define to be the probability of getting item in round given history when buyer has value and define to be the expected price paid in round when the buyer has value given history .
Let be the distribution of histories at round , for . The noregret guarantee tells us that
(2) 
Note that
Dividing (2) through by and substituting in these relations, we arrive at the statement of the lemma. ∎
Now define , and define
(3) 
(5) 
We will argue from these constraints that . To do this, we will construct a singleround mechanism for selling an item to a buyer with value distribution such that this mechanism has expected revenue ; the result then follows from the optimality of the Myerson mechanism ([Mye81]).
To construct this mechanism, first find a sequence of indices via the following algorithm.
It is easy to verify that following this algorithm results in . For any (assuming ), .
Lemma A.3.
For a bidder with value distribution , the following menu of options will achieve revenue at least : for each , the buyer has the choice of paying , and receiving the item with probability .
Proof.
Consider some value in . We will show that the buyer with value will pay at least , thus proving the lemma. Assume .
We have (from (5) and the monotonicity of ) that
This means the buyer with value receives nonnegative utility by choosing option . For any , we have (from (4)) that
Since , the above inequality implies that
It follows that
This means the buyer with value prefers option to all options . Therefore this buyer will choose an option from . Since , we know that this buyer will pay at least , as desired. ∎
It follows from the optimality of the Myerson auction that , and therefore that . Expanding out via (3), we have that
from which the theorem follows. ∎
We can remove the explicit dependence on by filtering out all values which occur with small enough probability.
Corollary A.4 (Restatement of Theorem 3.2).
There exists a noregret algorithm for the buyer where the seller receives no more than revenue.
Proof.
Ignore all values with (whenever a round with this value arises, choose an arbitrary action for this round). There are total values, so this happens with at most probability , and therefore modifies the regret and revenue in expectation by at most .
The regret bound from Theorem A.1 then holds with , from which the result follows. ∎
a.1 Multiple bidders
Interestingly, we show that by slightly modifying Algorithm 1, we obtain an algorithm (Algorithm 2) that works for the case where there are multiple bidders. In the multiple bidder setting, there are bidders with independent valuations for the item. Each round , bidder receives a value for the item drawn from a distribution (independently of all other values). Each distribution is supported over values, , where occurs under with probability . Every round each bidder submits a bid , and the auctioneer decides on an allocation rule , which maps tuples of bids to tuples of probabilities and a pricing rule , which maps tuples of bids to tuples of prices . The allocation rule must additionally obey the supply constraint that . Bidder wins the item with probability and pays .
We show that if every bidder plays the noregret algorithm Algorithm 2, then the auctioneer (even if playing adaptively) is guaranteed to receive no more than revenue, where is the optimal revenue obtainable by an auctioneer selling a single item to bidders with valuations drawn independently from distributions . In other words, if every bidder plays according to Algorithm 2, the seller can do nothing better than running the singleround optimal Myerson auction every round.
The only difference between Algorithm 1 and Algorithm 2 is that instance in Algorithm 2 has a value arm for every possible value, not only the values less than . This means that the recursion depth of this algorithm is potentially unlimited, however it will still terminate in finite expected time since we insist that has a positive probability of picking any arm (in particular, it will eventually pick a bid arm). We can optimize the runtime of step 11 of Algorithm 2
by eliciting a probability distribution over arms from each instance
, constructing a Markov chain, and solving for the stationary distribution. This takes time per step of this algorithm.Theorem A.5.
Let . If every bidder plays according to Algorithm 2 then the auctioneer (even if they play an adaptive strategy) receives no more than revenue.
Proof.
Similarly as before, let equal the expected number of rounds bidder receives the item while having value , and let equal the expected total amount bidder pays to the auctioneer while having value . Again, our goal is to upper bound , the total expected revenue the seller receives.
Note that, as before, since every strategy contains a zero option in its menu, we have that (for all and )
(6) 
Repeating the argument of Lemma A.2 (which still holds in the multiple bidder setting), we additionally have that (for all and ),
(7) 
We will now (as in the proof of Theorem A.1) construct a mechanism for the singleround instance of the problem of an auctioneer selling a single item to bidders with valuations independently drawn from . Our mechanism will work as follows:

The auctioneer will begin by asking each of the bidders for their valuations. Assume that bidder reports valuation (we will insist that belongs to the support of ).

The auctioneer will then sample a uniformly at random.

For each bidder , the auctioneer will calculate and , the expected allocation probability and price bidder has to pay in round of the dynamic round mechanism, conditioned on for all .

The auctioneer will then give the item to bidder with probability , and charge bidder a price .
Note that since the allocation rules must always satisfy the supply constraint, the probabilities we sample also obey this supply constraint, and therefore this is a valid mechanism for the singleround problem. We will now show it is approximately incentive compatible.
Lemma A.6.
Mechanism is Bayesian incentive compatible and exinterim individually rational.
Proof.
To begin, we claim that in expectation, if bidder reports valuation (and everyone else reports truthfully), then the expected probability bidder receives the item (under this singleround mechanism) is equal to . Likewise, we claim that, if bidder reports valuation (and everyone else reports truthfully), the expected payment bidder they pay is equal to .
To see why this is true, let equal the probability bidder gets the item (in the multiround mechanism) at time conditioned on for all . By construction, the probability bidder receives the item (in mechanism ) after reporting valuation is equal to
On the other hand, we can write in terms of our function a
It follows that . A similar calculation shows that if is the expected payment of bidder (if they report valuation and everyone else reports truthfully), then .
Now, recall that a mechanism is BIC if misreporting your value increases your expected utility by at most (assuming everyone else reports truthfully). To show that mechanism is BIC, it therefore suffices to show that for all , that
But for , this follows from equation (7). Similarly, is exinterim IR if for all ,
Again, this follows from equation (6), and the result therefore follows. ∎
We now apply the following lemma from [DW12], which lets us transform an BIC mechanism into a BIC mechanism at the cost of revenue.
Lemma A.7.
If is an BIC, exinterim IR mechanism for selling a single item to several bidders with independent valuations, then there exists a BIC, exinterim IR mechanism for the same problem that satisfies .
Proof.
See Theorem 3.3 in [DW12]. ∎
Applying Lemma A.7 to our mechanism, we obtain a mechanism that satisfies . Finally, note that since the Myerson auction is the optimal Bayesianincentive compatible mechanism for this problem, . On the other hand, since (from the proof of Lemma A.6) the expected payment bidder pays under mechanism when being truthful is equal to:
It follows that