Introduction
In recent years, numerous sharing economy platforms with a variety of goods and services have emerged. These platforms are shaped by users that primarily act in their own interest to maximize their utility. However, such behavior might interfere with the usefulness of the platforms. For example, users of mobility sharing systems typically prefer to drop off rentals at the location in closest proximity, while a more balanced distribution would allow the mobility sharing service to operate more efficiently.
Undesirable user behavior in the sharing economy is in many cases even selfreinforcing. For example, users in the apartment rental marketplace Airbnb are less likely to select infrequently reviewed apartments and are therefore unlikely to provide reviews for these apartments [Fradkin2014]. This is also reflected in the distribution of reviews, where in many cities of apartments account for more than of customer reviews^{1}^{1}1Data from insideairbnb.com..
Such dynamics create a need for platforms in the sharing economy to actively engage users to shape demand and improve efficiency. Several previous papers have proposed the idea of using monetary incentives to encourage desirable behavior in such systems. One example is [Frazier et al.2014], who studied the problem in a multiarmed bandit setting, where a principal (e.g. a marketplace) attempts to maximize utility by incentivizing agents to explore arms other than the myopically preferred one. In their setting, the optimal amount is known to the system, and the main goal is to quantify the required payments to achieve an optimal policy with myopic agents. The idea of shaping demand through monetary incentives in the sharing economy has also been tested in practice. For example, [Singla et al.2015] use monetary incentives to encourage users of bike sharing systems to return bikes at beneficial locations, making automatic offers through the bike sharing app.
In this context, an important question is what amounts a platform should offer to maximize its utility. [Singla et al.2015] introduce a simple protocol for learning optimal incentives in the bike sharing system to make users switch from the preferred station to a more beneficial one, ignoring information about specific switches and additional context. Extending on these ideas, we explore a general online learning protocol for efficiently learning optimal incentives.
Our Contributions
We provide the following main contributions in this paper:

Structural information: We consider structural information in user preferences to speed up learning of incentives, and provide a general framework to model structure across tasks via convex constraints. Our algorithm, Coordinated Online Learning (CoOL) is also of interest for related multitask learning problems.

Computational efficiency: We introduce two novel ideas of sporadic and approximate projections to increase the computational efficiency of our algorithm. We derive formal guarantees on the performance of the CoOL algorithm and achieve noregret bounds in this setting.

User study on Airbnb: We collect a unique data set through a user study with apartments on Airbnb and test the viability and benefit of the CoOL algorithm on this dataset.
Preliminaries
In the following, we introduce the general problem setting of this paper.
Platform. We investigate a general platform in the sharing economy, such as the apartment rental marketplace Airbnb. On this platform, users can choose from goods and services, denoted as items. A user that arrives at time chooses an item . If the user chooses to buy item , the platform gains utility .
Incentivizing exploration. The initial choice, item , might not maximize the platform’s utility, and the platform might be interested in offering a different item with utility instead. For example, could represent an infrequently reviewed item that the platform wants to explore. To motivate the user to select item instead, the platform can offer an incentive , for example in the form of a monetary discount on that item. The user can either accept or reject the offer depending on the private cost , where the user accepts the offer if and rejects the offer otherwise. If the user accepts the offer, the utility gain of the platform is .
Objective. In this setting, two tasks need to be optimized to achieve a high utility gain: finding good switches , and finding good incentives . Good switches are those, in which the achievable utility gain is positive, i.e. . To realize a positive utility gain, the offer needs to be greater or equal to , since otherwise the offer would be rejected.
In this paper, we focus on learning optimal incentives over time, while the platform chooses relevant switches independently.
Methodology
In this section, we present our methology for learning optimal incentives and start with a single pair of items . We allow for natural constraints on , such that , where is convex and nonempty. For example, might be lowerbounded by and upperbounded by the maximum discount that the platform is willing to offer.
Single Pair of Items
We consider the popular algorithmic framework of online convex programming (OCP) [Zinkevich2003] to learn optimal incentives for a single pair of items. The OCP algorithm is a gradientdescent style algorithm that updates with an adaptive learning rate and performs a projection after every gradient step to maintain feasibility within the constraints . We use to denote the number of times a pair of items has been observed and to denote the learning rate. To measure the performance of the algorithm, we use the loss , which is the difference between the optimal prediction and the prediction provided by the algorithm, such that , and ^{2}^{2}2
Note that this loss function is nonconvex. A convex version is presented in the case study.
.Multiple Pairs of Items
We now relax the assumption of a fixed pair of items and return to our original problem of learning optimal incentives for multiple pairs of items, i.e. the algorithm receives specific items and as input for each user. If we consider all items on the platform, the total number of pairs is .
For learning the optimal incentive for each pair of items, the algorithm maintains a specific learning rate proportional to for each pair of items and performs one gradient update step using Algorithm 1. We refer to this straightforward adaptation of the OL algorithm as Independent Online Learning (IOL) and use this algorithm as a baseline for our analysis. Using regret bounds of [Zinkevich2003] and denoting the number of pairs of items as , we can upper bound the regret of IOL as
(2) 
Structural Information
In a realworld setting, incentives for different pairs of items typically are not independent, and in some cases, certain structural information may help to speed up learning of optimal incentives. In the following, we discuss several relevant types of structural information.
Independent learning. In this baseline setting each pair of items is learned individually. Thus, the number of incentives that need to be learned grows quadratically with the number of items on a platform. While applicable for a small number of items, this approach is not favorable on typical platforms in the sharing economy.
Shared learning. Another commonly studied setting is shared learning. In this setting, all pairs of items are considered equivalent, and only one global incentive is learned. While allowing the platform to learn about many pairs of items at the same time, this approach fails to consider natural asymmetries in the problem. For example, the required incentive for switching from to is often different than the required incentive for switching from to , as can be also observed in the case study of this paper.
Metric/hemimetric structure. Assuming that the required incentives are related to the dissimilarity of items and , metrics are a natural choice to model structural dependencies, as they capture the property of triangle inequalities in dissimilarity functions. However, incentives for pairs of items are not necessarily required to be symmetric. For example, the required incentives for switching from a highly reviewed apartment on Airbnb to one without reviews is likely higher than vice versa. Therefore, we use hemimetrics, which are a relaxed form of a metric that satisfy only nonnegativity constraints and triangular inequalities, capturing asymmetries in preferences (cf. [Singla, Tschiatschek, and Krause2016]). The usefulness of the hemimetric structure for learning optimal incentives is demonstrated in the experiments.
In the following section, we introduce a generalpurpose algorithm for learning with structural information, where the structure is defined by convex constraints on the solution space. The key idea of our algorithm is to coordinate between individual pairs of items by projecting onto the resulting convex set. We generalize our approach for contextual learning, where additional features, such as information about users, may be available. Since projecting onto convex sets may be computationally expensive, we further extend our analysis to allow projections to be sporadic (i.e. only after certain gradient steps) and approximate (i.e. with some error compared to the optimal projection).
Learning with Structural Information
We begin this section by introducing a general framework for specifying structural information via convex constraints. We denote each pair of items as a distinct problem , where is the total number of pairs of items. Each problem may be associated with additional features, for example with information about the current user. As is common in online learning, we consider a dimensional weight vector for each problem for learning optimal incentives. The prediction is equal to the inner product between and the dimensional feature vector. In the previous section, we described the special case with and a unit feature vector, such that is equivalent to the prediction .
Specifying Structure via Convex Constraints
Similar to constraints on , we allow for convex constraints on , such that . We assume is a convex, nonempty, and compact set, where is the Euclidean norm of the solution space^{3}^{3}3Euclidean norm is used throughout, unless otherwise specified.. Further, we assume for some constant . We denote the joint solution space of the problems as and define as the concatenation of the problem specific weight vectors at time , i.e.
The available structural information is modelled by a set of convex constraints, such that the joint competing weight vector , against which the loss at each round is measured, lies in a convex, nonempty, and closed set , representing a restricted joint solution space, i.e. . In the following, we provide several practical examples of how can be defined.
Independent learning. models the setting where the problems are unrelated/independent.
Shared learning. A shared parameter setting can be modeled as
Instead of sharing all parameters, another common scenario is to share only a few parameters. For a given , sharing parameters across the problems can be modeled as
where denotes the first entries in . This approach is useful for sharing certain parameters that do not depend on the specific problem. For example, in the case of apartments on Airbnb, a shared feature could be the distance between apartments.
Hemimetric structure. To model dissimilarities between items for learning optimal incentives, we use the hemimetric set. Specifically, we use rbounded hemimetrics, which, next to nonnegativity constraints and triangular inequalities, also include nonnegativity and upper bound constraints. For , the convex set representing bounded hemimetrics is given by
Our Algorithm
In the following, we introduce our algorithm, Coordinated Online Learning (CoOL).
Exploiting Structure via Weighted Projections. The CoOL algorithm exploits structural information in a principled way by performing weighted projections to , with weights for a problem proportional to . Intuitively, the weights allow us to learn about problems that have been observed infrequently while avoiding to “unlearn” problems that have been observed more frequently. A formal justification for using weighted projections is provided in the extended version of this paper [Hirnschall et al.2018].
We define as a square diagonal matrix of size with each represented times. In the onedimensional case (), we can write as
(3) 
Using to jointly represent the current weight vectors of all the learners at time (cf. Line 2 in Algorithm 2), we compute the new joint weight vector (cf. Line 2 in Algorithm 2) by projecting onto , using
(4) 
We refer to this as the weighted projection onto . Since is convex and the projection is a special case of the Bregman projection, the projection onto is unique (cf. [CesaBianchi and Lugosi2006, Rakhlin and Tewari2009]).
Sporadic and Approximate Projections. For large scale applications (i.e. large or large ), projecting at every step could be computationally very expensive: a projection onto a generic convex set would require solving a quadratic program of dimension . To allow for computationally efficient updates, we introduce two novel algorithmic ideas: sporadic and approximate projections, defined by the abovementioned sequences and . Here, denotes the desired accuracy at time and is given as input to Function 3, AProj, for computing approximate projections. This way, the accuracy can be efficiently controlled using the duality gap of the projections. As we shall see in our experimental results, these two algorithmic ideas of sporadic and approximate projections allow us to speed up the algorithm by an order of magnitude while retaining the improvements obtained through the projections.
Algorithm 2, when invoked with , corresponds to a variant of our algorithm with exact projections at every time step. When invoked with , our algorithm corresponds to the IOL baseline.
Relation to existing approaches. A related algorithm is the AdaGrad algorithm [Duchi, Hazan, and Singer2011], which uses the sum of the magnitudes of past gradients to determine the learning rate at each time , where larger past gradients correspond to smaller learning rates. A key difference to the CoOL algorithm is that the AdaGrad algorithm enforces exact projections after every iteration. This is particularly problematic for large, complex structures since projections on these structures often rely on numeric approximations, that may not guarantee to converge to the exact solution in finite time.
Performance Guarantees and Analysis
In this section, we analyze worstcase regret bounds of the CoOL algorithm against a competing weight vector . The proofs are provided in the extended version of this paper [Hirnschall et al.2018].
General Bounds
We begin with a general result, without assumptions on the projection accuracy and rate.
Theorem 1.
The regret of the CoOL algorithm is bounded by
(R1)  
(R2)  
(R3)  
(R4) 
The regret in Theorem 1 has four components. R1 comes from the standard regret analysis in the OCP framework, R2 comes from sporadic projections, R3 comes from the allowed error in the projections, and R4 is a constant.
Note that when for all (i.e. no projections are performed) and is proportional to , we get the same regret bounds proportional to as for the IOL algorithm. This also reveals the worstcase nature of the regret bounds of Theorem 1, i.e. the proven bounds for CoOL are agnostic to the specific structure and the order of task instances.
Sporadic/Approximate Projection Bounds
To provide specific bounds for the practically useful setting of sporadic and approximate projections, we introduce and and the user chosen parameters and to control the frequency and accuracy of the projections.
Corollary 1.
Set . , define
where constants , , and . The expected regret (w.r.t. ) of the CoOL algorithm is bounded by
As shown in Corollary 1, projections are required to be more accurate for higher values of . Intuitively, this is required so that already learned weights are not unlearned through inaccurate projections. Using the definitions under Corollary 1, we can prove worstcase regret bounds proportional to for this setting.
Performance Analysis for Hememtric Structure
We now test the performance of the CoOL algorithm on synthetic data with an underlying hemimetric structure.
Hemimetric projection. To be able to perform weighted projections onto the hemimetric polytope, we use the metric nearness algorithm [Sra, Tropp, and Dhillon2004] as a starting point. For our purposes, three modifications of the algorithm are required: First, we lift the requirement of symmetry to generalize from metrics to hemimetrics. Second, the metric nearness algorithm does not guarantee a solution in the metric set in finite time. However, to calculate the duality gap, the solution is required to be feasible. Thus, we apply the FloydWarshall algorithm [Floyd1962] after every iteration to receive a solution in the hemimetric set. Third, we add weights to the triangle inequalities to allow for weighted projections and further add upper and lower bound constraints.
Data structure. To empirically test the performance of the CoOL algorithm on the hemimetric set, we synthetically generate data with and model the underlying structure as a set of bounded hemimetrics with , resulting in problems. We use a simple underlying groundtruth hemimetric , where the items belong to two equalsized clusters, with if and are from the same cluster and otherwise. The results of our experiment in Figure 1 illustrate the potential runtime improvement using sporadic/approximate projections.
Random order of problems. Problem instances are chosen uniformly at random at every time step. The CoOL algorithm achieves a significantly lower regret than the IOL algorithm, benefiting from the weighted projections onto . At , the regret of CoOL is less than half of that of the IOL, cf. Figure 1(a).
Batches of problems.
In the batch setting, a problem instance is chosen uniformly at random, then it is repeated five times before choosing a new problem instance. Compared to the abovementioned random order, the IOL algorithm suffers a lower regret because of a higher probability that problems are repeatedly shown. Furthermore, the benefit of the projections onto
for the CoOL algorithm is reduced, cf. Figure 1(b), showing that the benefit of the projections depends on the specific order of the problem instances for a given structure.Singleproblem setting. A single problem is repeated in every round. As illustrated, in this case the IOL algorithm and the CoOL algorithm have the same regret, cf. Figure 1(c). In order to get a better understanding of using weights for the weighted projection, we also show a variant uwCoOL using
as identity matrix. Unweighted projection or using the wrong weights can hinder the convergence of the learners, as shown in Figure
1(c) for this extreme case of a singleproblem setting.Varying the rate of projection (). The regret of the CoOL algorithm monotonically increases as decreases, and is equivalent to the regret of the IOL algorithm at , cf. Figure 1(d). In the range of values between and , the regret of the CoOL algorithm is relatively constant and increases strongly only as approaches . With as low as , the regret of the CoOL algorithm in this setting is still almost half of that of the IOL algorithm.
Varying the accuracy of projection (). The regret of the CoOL algorithm monotonically increases as decreases, and exceeds that of the IOL algorithm for values smaller than because of high errors in the projections, cf. Figure 1(e). In the range of values between and , the regret of the CoOL algorithm is relatively constant and less than half of that of the IOL algorithm.
Runtime vs. approximate projections. As expected, the runtime of the projection monotonically decreases as decreases, cf. Figure 1(f). For values of smaller than , the runtime of the projection is less than of that of the exact projection. Thus, with values in the range of to , the CoOL algorithm achieves the best of both worlds: the regret is significantly smaller than that of IOL, with an order of magnitude speed up in the runtime compared to exact projections.
Airbnb Case Study
To test the viability and benefit of the CoOL algorithm in a realistic setting, we conducted a user study with data from the marketplace Airbnb.
Experimental Setup
We use the following setup in our user study:
Airbnb dataset. Using data of Airbnb apartments from insideairbnb.com, we created a dataset of 20 apartments as follows: we chose apartments from types in New York City by location (Manhattan or Brooklyn) and number of reviews (high, or low, ). From each type we chose 5 apartments, resulting in a total sample of apartments.
Survey study on MTurk platform. In order to obtain realworld distributions of the users’ private costs, we collected data from Amazon’s Mechanical Turk marketplace. After several introductory questions about their preferences and familiarity with travel accommodations, participants were shown two randomly chosen apartments from the Airbnb dataset. To choose between the apartment, participants were given the price, location, picture, number of reviews and rating of each apartment, as shown in Figure 2. Participants were first asked to select their preferred choice between the two apartments. Next, they were asked to specify their private cost for choosing the other, less preferred apartment instead. The collected data from the responses consists of tuples , where is the preferred choice, is the suggested alternative, and is the private cost of the user.
Sample. In total, we received responses, as summarized in Table 1. The sample for the performance analysis of the CoOL algorithm consists of responses, in which was a frequently reviewed apartment, an infrequently reviewed apartment, and participants were willing to explore the infrequently reviewed apartment for a discount (i.e. they did not select NA).
Responses  Accepted  Avg. Discount  

High  Low  416  77.6%  29.5$ 
Low  Low  228  83.3%  28.1$ 
High  High  219  82.2 %  25.4$ 
Low  High  80  81.3%  25.9$ 
Utility gain. The utility gain for getting a review for infrequently reviewed apartments is set to in our experiments, based on referral discounts given by Airbnb in the past.
Loss function. As introduced in the methodology section, we require a convex version of the true loss function for our online learning framework, ideally acting as a surrogate of the true loss. Additionally, the gradient of the loss function needs to be calculated from the binary feedback of acceptance/rejection of the offers. However, in the analyzed model with binary feedback, a loss function that satisfies both requirements cannot be constructed. Instead, we consider a simplified piecewise linear convex loss function given by , where denotes the magnitude of the gradient when a user rejects the offer. For the experiment, we use a delta value of . Due to this transformation, we use the utility gain rather than the loss as a useful measure of the performance of the CoOL algorithm.
Structure. Due to the small number of apartments, we consider a noncontextual setting with and use an rbounded hemimetric structure to model the relationship of the tasks, where is set to to avoid recommending incentives . Using a setting with would allow for realworld applications with additional context.
Category  Example keywords  Mentions 

Location  neighborhood, distance  477 
Reviews  rating, star  309 
Price  expensive, cheap  182 
Picture  image, photo  169 
Main Results
We now present and discuss the results of the user study.
Descriptive statistics. Out of all responses, 758 (80.4%) respondents were willing to accept an offer for their less preferred apartment, given a certain discount per night. Out of these respondents, the average required discount for accepting the alternative apartment was 27.9 USD per night. The average required discount for switching from a frequently reviewed apartment to an infrequently reviewed apartment was higher.
Out of the respondents who could choose between a frequently and an infrequently reviewed apartment, 83.9% respondents chose the frequently reviewed apartment, while only 16.1% respondents chose the infrequently reviewed apartment.
In the responses to an open question about the factors respondents considered to decide on the discount, we captured the frequency at which different factors were mentioned by defining several keywords for each factor. The number of times each factor was mentioned is shown in Table 2.
Algorithm performance. We use the cumulative utility gain to measure the performance of the IOL and the CoOL algorithm. The utility gains of both algorithms after responses are shown in Figure 3(c). The utility gain in Figure 3(b) is almost 50% higher for the CoOL algorithm than for the IOL algorithm. Figure 3(c) reveals that this gain is mainly achieved due to a significant speed up in learning over the first 50 problems.
Discussion. The results of the user study confirm several findings of [Fradkin2014], who studied the booking behavior on Airbnb. Similar to this study, we find that apartments with a high number of reviews are significantly more likely to be selected. We also find that the average required discount per night is higher when the alternative choice is an infrequently reviewed apartment. This also points toward a difference in willingness to pay between frequently and infrequently reviewed apartments. Similar results have been found in earlier studies on other marketplaces [Resnick et al.2006, Ye, Law, and Gu2009, Luca2011].
The user study also confirms that incentives influence buying behavior and can help increase exploration on online marketplaces [Avery, Resnick, and Zeckhauser1999, Robinson, Nightingale, and Mongrain2012]; when respondents chose a frequently reviewed apartment and were asked to instead choose an infrequently reviewed apartment, 77.6% of respondents were willing to accept a sufficiently large offer. More than 10% of those respondents were willing to accept a discount of 10 USD per night or less.
The performance of the IOL and CoOL algorithm in Figures 3(b) and 3(c) suggests that incentives can be learned via online learning, and that structural information can be used to significantly speed up the learning. Further, the speed up in learning directly increases the marketplace’s utility gain from suggesting alternative items. To reduce the problem size on a realworld application such as Airbnb, items could be grouped by features such as location or number of reviews. Further, problemspecific features, such as the distance between apartments could be added to increase the accuracy of the prediction.
Related Work
Multiarmed bandit / Bayesian games. A related path of research are multiarmed bandit and Bayesian games, where a principal attempts to coordinate agents to maximize its utility. Research in this area mainly focuses on changing the behavior of agents in the way information is disclosed, rather than through provision of payments. [Kremer, Mansour, and Perry2014] provide optimal information disclosure policies for deterministic utilities and only two possible actions. [Mansour, Slivkins, and Syrgkanis2015] generalize the results for stochastic utilities and a constant number of actions. Further, [Mansour et al.2016] consider the interaction of multiple agents, and [Chakraborty et al.2017] analyze a multiarmed bandits in the presence of communication costs. Our problem is different to previous research in that utilities are not required to be stochastic, and additional structural information is available to the principal.
Recommender systems. A different approach to encouraging exploration in online marketplaces are recommender systems, which are known to influence buyers’ purchasing decisions and can be used to encourage exploration [Resnick and Varian1997, Senecal and Nantel2004]. For example, greedy recommender systems recommend a product closest to the buyer’s preferences with probability () and a random product with probability [Ten Hagen, Van Someren, and Hollink2003]. Such recommender systems can be extended using ideas studied in this paper.
Online/distributed multitask learning. Multitask learning has been increasingly studied in online and distributed settings recently. Inspired by wearable computing, a recent work by [Jin et al.2015] studied online multitask learning in a distributed setting. They considered a setup, where tasks arrive asynchronously, and the relatedness among the tasks is maintained via a correlation matrix. However, there is no theoretical analysis on the regret bounds for the proposed algorithms. [Wang, Kolar, and Srerbo2016] recently studied the multitask learning for distributed LASSO with shared support. Their work is different from ours — we consider general convex constraints to model task relationships and consider the adversarial online regret minimization framework.
Conclusions and Future Work
We highlighted the need in the sharing economy to actively shape demand by incentivizing users to differ from their preferred choices and explore different options instead. To learn the incentives users require to choose different items, we developed a novel algorithm, CoOL, which uses structural information in user preferences to speed up learning. The key idea of our algorithm is to exploit structural information in a computationally efficient way by performing sporadic and approximate projections. We formally derived noregret bounds for the CoOL algorithm and provided evidence for the increase in performance over the IOL baseline through several experiments. In a user study with apartments from the rental marketplace Airbnb, we demonstrated the practical applicability of our approach in a realworld setting. To conclude, we discuss several additional considerations for offering incentives in a sharing economy platform.
Safety/individual consumer loss. Generally, exploration in the sharing economy may be risky, and individuals can face severe losses while exploring. For example, new hosts might not be trustworthy, and new drivers in ridesharing systems might not be reliable. In our approach, the items to be explored are controlled by the platform, and appropriate preconditions would need to be implemented to minimize risks.
Reliability/Consistency. In order for platforms to implement an algorithmic provision of monetary incentives, it is important that incentives are reliable and consistent over time. Ideally, similar users should receive similar incentives, and offers should be consistent with the user’s preferences. Using the CoOL algorithm, consistency can be controlled through appropriate convex constraints.
Strategyproofness. Providing monetary incentives based on user preferences creates possibilities for opportunistic behavior. For example, users could attempt to repeatedly decline offers to receive higher offers in the future or browse certain items hoping to receive offers for similar items. To control for such behavior, markets need to be large enough so that behavior of individuals does not affect overall learning. Further, platforms can control the number and frequency with which individual users receive offers to minimize opportunistic possibilities.
Acknowledgments
This work was supported in part by the Swiss National Science Foundation, and NanoTera.ch program as part of the Opensense II project, ERC StG 307036, and a Microsoft Research Faculty Fellowship. Adish Singla acknowledges support by a Facebook Graduate Fellowship.
References
 [Avery, Resnick, and Zeckhauser1999] Avery, C.; Resnick, P.; and Zeckhauser, R. 1999. The market for evaluations. American Economic Review 564–584.
 [Beckenbach and Bellman2012] Beckenbach, E. F., and Bellman, R. 2012. Inequalities, volume 30. Springer Science & Business Media.
 [CesaBianchi and Lugosi2006] CesaBianchi, N., and Lugosi, G. 2006. Prediction, learning, and games. Cambridge university press.

[Chakraborty et al.2017]
Chakraborty, M.; Chua, K. Y. P.; Das, S.; and Juba, B.
2017.
Coordinated versus decentralized exploration in multiagent
multiarmed bandits.
In
Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI17
, 164–170. 
[Duchi, Hazan, and
Singer2011]
Duchi, J.; Hazan, E.; and Singer, Y.
2011.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
12:2121–2159.  [Floyd1962] Floyd, R. W. 1962. Algorithm 97: shortest path. Communications of the ACM 5(6):345.
 [Fradkin2014] Fradkin, A. 2014. Search frictions and the design of online marketplaces. NBER Working Paper.
 [Frazier et al.2014] Frazier, P.; Kempe, D.; Kleinberg, J.; and Kleinberg, R. 2014. Incentivizing exploration. In Proceedings of the fifteenth ACM conference on Economics and computation, 5–22. ACM.
 [Hirnschall et al.2018] Hirnschall, C.; Singla, A.; Tschiatschek, S.; and Krause, A. 2018. Learning user preferences to incentivize exploration in the sharing economy (extended version).
 [Jin et al.2015] Jin, X.; Luo, P.; Zhuang, F.; He, J.; and He, Q. 2015. Collaborating between local and global learning for distributed online multiple tasks. In CIKM.
 [Kremer, Mansour, and Perry2014] Kremer, I.; Mansour, Y.; and Perry, M. 2014. Implementing the “wisdom of the crowd”. Journal of Political Economy 122(5):988–1012.
 [Luca2011] Luca, M. 2011. Reviews, reputation, and revenue: The case of yelp. com. Harvard Business School NOM Unit Working Paper.
 [Mansour et al.2016] Mansour, Y.; Slivkins, A.; Syrgkanis, V.; and Wu, Z. S. 2016. Bayesian exploration: Incentivizing exploration in bayesian games. In Proceedings of the 2016 ACM Conference on Economics and Computation, EC ’16, 661–661. New York, NY, USA: ACM.
 [Mansour, Slivkins, and Syrgkanis2015] Mansour, Y.; Slivkins, A.; and Syrgkanis, V. 2015. Bayesian incentivecompatible bandit exploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, 565–582. New York, NY, USA: ACM.
 [Rakhlin and Tewari2009] Rakhlin, A., and Tewari, A. 2009. Lecture notes on online learning. Draft, April.
 [Resnick and Varian1997] Resnick, P., and Varian, H. R. 1997. Recommender systems. Communications of the ACM 40(3):56–58.
 [Resnick et al.2006] Resnick, P.; Zeckhauser, R.; Swanson, J.; and Lockwood, K. 2006. The value of reputation on ebay: A controlled experiment. Experimental economics 9(2):79–101.
 [Robinson, Nightingale, and Mongrain2012] Robinson, J. G.; Nightingale, T. R.; and Mongrain, S. A. 2012. Methods and systems for obtaining reviews for items lacking reviews. US Patent 8,108,255.
 [Senecal and Nantel2004] Senecal, S., and Nantel, J. 2004. The influence of online product recommendations on consumers’ online choices. Journal of retailing 80(2):159–169.
 [Singla et al.2015] Singla, A.; Santoni, M.; Bartók, G.; Mukerji, P.; Meenen, M.; and Krause, A. 2015. Incentivizing users for balancing bike sharing systems. In AAAI.
 [Singla, Tschiatschek, and Krause2016] Singla, A.; Tschiatschek, S.; and Krause, A. 2016. Actively learning hemimetrics with applications to eliciting user preferences. In ICML.
 [Sra, Tropp, and Dhillon2004] Sra, S.; Tropp, J.; and Dhillon, I. S. 2004. Triangle fixing algorithms for the metric nearness problem. In Advances in Neural Information Processing Systems, 361–368.
 [Ten Hagen, Van Someren, and Hollink2003] Ten Hagen, S.; Van Someren, M.; and Hollink, V. 2003. Exploration/exploitation in adaptive recommender systems. Proceedings of Eunite 2003.
 [Wang, Kolar, and Srerbo2016] Wang, J.; Kolar, M.; and Srerbo, N. 2016. Distributed multitask learning. In AISTATS.
 [Ye, Law, and Gu2009] Ye, Q.; Law, R.; and Gu, B. 2009. The impact of online user reviews on hotel room sales. International Journal of Hospitality Management 28(1):180–182.
 [Zinkevich2003] Zinkevich, M. 2003. Online convex programming and generalized infinitesimal gradient ascent. In ICML.
Appendix A Outine of the Supplement
We start the supplement by introducing properties of the Bregman divergence and additional notation required for the proofs of the regret bounds. We further introduce two basic propositions and several lemmas. We then provide formal justification for using weighted projection in the CoOL algorithm, cf. Equation (4). Lastly, we provide the proof of the regret bound of the CoOL algorithm in Theorem 1 and Corollary 1.
Appendix B Preleminaries
Bregman Divergence
For any strictly convex function , the Bregman divergence between , is defined as the difference between the value of at , and the firstorder Taylor expansion of around evaluated at , i.e.
We use the following properties of the Bregman divergence, cf. [Rakhlin and Tewari2009]:

The Bregman divergences is nonnegative.

The Bregman projection
onto a convex set exists and is unique.

For defined as in the Bregman projection above and , by the generalized Pythagorean theorem, cf. [CesaBianchi and Lugosi2006], the Bregman divergence satisfies

The threepoint equality
follows directly from the definition of the Bregman divergence.
Notation
Throughout the supplement we use and as per Equation (3). Similar to the definition of , we also define , and as the concatenation of the task specific feature and gradient vectors, i.e.
where for all , and are in all positions that do not correspond to task . We also use to refer to the concatenation of the updated task specific weights, before any coordination, such that
where for and otherwise.
Appendix C Propositions
In the following we introduce two basic propositions that we need for the proof of Theorem 1.
Proposition 1.
If for all , and then
Proof.
Extending and applying the CauchySchwarz inequality, we get
∎
Proposition 2.
The sum from is bounded by .
Proof.
∎
Appendix D Lemmas
In this section we introduce the lemmas required for the proof of the regret bounds of the CoOL algorithm. Applying Lemma 1 allows us to replace the loss function with its linearization, similar to [Zinkevich2003]. Lemmas 2 and 3 allow us to get an equivalent update procedure, using the Bregman divergence, and Lemma 4 gives a handle on the linearized regret bound, cf. [Rakhlin and Tewari2009]. Lemma 5 uses the duality gap to upper bound the Bregman divergence between the exact and approximate projection. Lemmas 6 and 7 provide different upper bounds on the Bregman divergence.
Lemma 1.
For all and there exists a such that can be replaced with without loss of generality.
Proof.
The loss function affects the regret in two ways: First, the loss function’s gradient is used in the update step, and second, the loss function is used to calculate the regret of the algorithm. Let and consider the linearized loss . Using the linearized loss, the behavior of the algorithm remains unchanged, since . Further, the regret either increases or remains unchanged, since the loss function is convex, such that for all
Rearranging, we get
such that using a linearized loss, the regret either remains constant or increases. ∎
Lemma 2.
For , the update rule
is equivalent to the update rule
Proof.
For the second update rule, inserting into the definition of the Bregman divergence and setting the derivative with respect to evaluated at to zero, we have
Rewriting, using that is nonzero only in entries that correspond to , and applying the definitions of and , we get
∎
Lemma 3.
For , the update rule
where , is equivalent to the update rule
Proof.
Applying the definition of , we can rewrite
∎
Lemma 4.
If is the constraint minimizer of the objective as stated in Lemma 3, then for any a in the solution space,
Proof.
Since is the constraint minimizer of the objective , any vector pointing away from into the solution space has a positive product with the gradient of the objective at , such that
Rewriting and using the threepoint equality, we get
∎
Lemma 5.
If is the exact solution of
and is an approximate solution with duality gap less than , then
Proof.
The duality gap is defined as the difference between the primal and dual value of the solution. The dual value is upper bounded by the optimal solution and thus less than or equal to . Thus, for the primal solution with duality gap less than , we have
Note that is the projection of onto and . Thus, using the propertiesof the Bregman divergence we can apply the generalized Pythagorean theorem such that
Inserting into the above inequality we get the result. ∎
Lemma 6.
For and and ,
Proof.
Lemma 7.
For any two , ,
Proof.
Applying our definition of , we can rewrite
Note that
Appendix E Idea of Weighted Projections
Intuitively, the CoOL algorithm restricts the solution to , such that the update can be rewritten as
Comments
There are no comments yet.