1 Introduction
The theory of multiarmed bandits plays a key role in enabling personalization in the digital economy (Scott, 2015). Algorithms from this domain have successfully been deployed in a diverse array of applications including online advertising (Mehta and Mirrokni, 2011; Lu et al., 2010), crowdsourcing (TranThanh et al., 2014), content recommendation (Li et al., 2010), and selecting userspecific incentives (Ghosh and Hummel, 2013; Jain et al., 2014) (e.g., a retailer offering discounts). On the theoretical side, this has been complemented by a litany of nearoptimal regret bounds for multiarmed bandit settings with rich combinatorial structures and complex agent behavior models (Chen et al., 2016; Gai et al., 2011; Kveton et al., 2015; Sani et al., 2012). At a high level, the broad appeal of bandit approaches for allocating resources to human agents stems from its focus on balancing exploration with exploitation, thereby allowing a decisionmaker to efficiently identify users’ preferences without sacrificing shortterm rewards.
Implicit in most of these works is the notion that in largescale environments, a designer can simultaneously allocate resources to multiple users by running independent bandit instances. In reality, such independent decompositions do not make sense in applications where resources are subject to physical or monetary constraints. In simple terms, matching an agent to a resource immediately constrains the set of resources to which another agent can be matched. Such supply constraints may arise even when dealing with intangible products. For instance, social media platforms (e.g., Quora) seek to maximize user participation by offering incentives in the form of increased recognition—e.g., featured posts or badges (Immorlica et al., 2015). Of course, there are supply constraints on the number of posts or users that can be featured at a given time. As a consequence of these coupling constraints, much of the existing work on multiarmed bandits does not extend naturally to multiagent economies.
Yet, another important aspect not addressed by the literature concerns human behavior. Specifically, users’ preferences over the various resources may be dynamic—i.e. evolve in time as they are repeatedly exposed to the available options. The problem faced by a designer in such a dynamic environment is compounded by the lack of information regarding each user’s current state or beliefs, as well as how these beliefs influence their preferences and their evolution in time.
Bearing in mind these limitations, we study a multiarmed bandit problem for matching multiple agents to a finite set of incentives^{1}^{1}1We use the term incentive broadly to refer to any resource or action available to the agent. That is, incentives are not limited to monetary or financial mechanisms.: each incentive belongs to a category and global capacity constraints control the number of incentives that can be chosen from each category. In our model, each agent has a preference profile or a type
that determines its rewards for being matched to different incentives. The agent’s type evolves according to a Markov decision process (MDP), and therefore, the rewards vary over time
in a correlated fashion.Our work is primarily motivated by the problem faced by a technological platform that seeks to not just maximize user engagement but also to encourage users to make changes in their status quo decisionmaking process by offering incentives. For concreteness, consider a bikesharing service—an application we explore in our simulations—that seeks to identify optimal incentives for each user from a finite bundle of options—e.g., varying discount levels, free future rides, bulk ride offers, etc. Users’ preferences over the incentives may evolve with time depending on their current type, which in turn depends on their previous experience with the incentives. In addition to their marketing benefits, such incentives can serve as a powerful instrument for nudging users to park their bikes at alternative locations—this can lead to spatially balanced supply and consequently, lower rejection rates (Singla et al., 2015).
1.1 Contributions and Organization
Our objective is to design a multiarmed bandit algorithm that repeatedly matches agents to incentives in order to minimize the cumulative regret over a finite time horizon. Here, regret is defined as the difference in the reward obtained by a problem specific benchmark strategy and the proposed algorithm (see Definition 4). A preliminary impediment in achieving this goal is the fact that the capacitated matching problem studied in this work is NPHard even in the offline case. The major challenge therefore is whether we can achieve sublinear (in the length of the horizon) regret in the more general matching environment without any information on the users’ underlying beliefs or how they evolve?
Following preliminaries (Section 2), we introduce a simple greedy algorithm that provides a –approximation to the optimal offline matching solution (Section 3). Leveraging this first contribution, the central result in this paper (Section 4) is a new multiarmed bandit algorithm—MatchGreedyEpochUCB (MGEUCB)—for capacitated matching problems with timeevolving rewards. Our algorithm obtains logarithmic (and hence sublinear) regret even for this more general bandit problem. The proposed approach combines ideas from three distinct domains: (i) the –rd approximate greedy matching algorithm, (ii) the traditional UCB algorithm (Auer et al., 2002), and (iii) mixing times from the theory of Markov chains.
We validate our theoretical results (Section 5) by performing simulations on both synthetic and realistic instances derived using data obtained from a Bostonbased bikesharing service Hubway (hub, ). We compare our algorithm to existing UCBbased approaches and show that the proposed method enjoys favorable convergence rates, computational efficiency on large data sets, and does not get stuck at suboptimal matching solutions.
1.2 Background and Related Work
Two distinct features separate our model from the majority of work on the multiarmed bandit problem: (i) our focus on a capacitated matching problem with finite supply (every user cannot be matched to their optimal incentive), and (ii) the rewards associated with each agent evolve in a correlated fashion but the designer is unaware of each agent’s current state. Our work is closest to (Gai et al., 2011) which considers a matching problem with Markovian rewards. However, in their model the rewards associated with each edge evolve independently of the other edges; as we show via a simple example in Section 2.2, the correlated nature of rewards in our instance can lead to additional challenges and convergence to suboptimal matchings if we employ a traditional approach as in (Gai et al., 2011).
Our work also bears conceptual similarities to the rich literature on combinatorial bandits (Badanidiyuru et al., 2013; Chen et al., 2016; Kveton et al., 2014, 2015; Wen et al., 2015)
. However, unlike our work, these papers consider a model where the distribution of the rewards is static in time. For this reason, efficient learning algorithms leveraging oracles to solve generic constrained combinatorial optimization problems developed for the combinatorial semibandit setting
(Chen et al., 2016; Kveton et al., 2015) face similar limitations in our model as the approach of (Gai et al., 2011). Moreover, the rewards in our problem may not have a linear structure so the approach of (Wen et al., 2015) is not applicable.The novelty in this work is not the combinatorial aspect but the interplay between combinatorial bandits and the edge rewards evolving according to an MDP. When an arm is selected by an oracle, the reward of every edge in the graph evolves——how it evolves depends on which arm is chosen. If the change occurs in a suboptimal direction, this can affect future rewards. Indeed, the difficulties in our proofs do not stem from applying an oracle for combinatorial optimization, but from bounding the secondary regret that arises when rewards evolve in a suboptimal way.
Finally, there is a somewhat parallel body of work on singleagent reinforcement learning techniques
(Jaksch et al., 2010; Mazumdar et al., 2017; Azar et al., 2013; Ratliff et al., 2018) and expert selection where the rewards on the arms evolve in a correlated fashion as in our work. In addition to our focus on multiagent matchings, we remark that many of these works assume that the designer is aware (at least partially) of the agent’s exact state and thus, can eventually infer the nature of the evolution. Consequently, a major contribution of this work is the extension of UCBbased approaches to solve MDPs with a fully unobserved state and rewards associated with each edge that evolve in a correlated fashion.2 Preliminaries
A designer faces the problem of matching agents to incentives (more generally jobs, goods, content, etc.) without violating certain capacity constraints. We model this setting by means of a bipartite graph where is the set of agents, is the set of incentives to which the agents can be matched, and is the set of all pairings between agents and incentives. We sometimes refer to as the set of arms. In this regard, a matching is a set such that every agent and incentive is present in at most one edge belonging to .
Each agent is associated with a type or state , which influences the reward received by this agent when matched with some incentive . When agent is matched to incentive
, its type evolves according to a Markov process with transition probability kernel
. Each pairing or edge of the bipartite graph is associated with some reward that depends on the agent–incentive pair, , as well as the type .The designer’s policy (algorithm) is to compute a matching repeatedly over a finite time horizon in order to maximize the expected aggregate reward. In this work, we restrict our attention to a specific type of multiarmed bandit algorithm that we refer to as an epoch mixing policy. Formally, the execution of such a policy is divided into a finite number of time indices , where is the length of the time horizon. In each time index , the policy selects a matching and repeatedly ‘plays’ this matching for iterations within this time index. We refer to the set of iterations within a time index collectively as an epoch. That is, within the
–th epoch, for each edge
, agent is matched to incentive and the agent’s type is allowed to evolve for iterations. In short, an epoch mixing policy proceeds in two time scales—each selection of a matching corresponds to an epoch comprising of iterations for , and there are a total of epochs. It is worth noting that an epochbased policy was used in the UCB2 algorithm (Auer et al., 2002), albeit with stationary rewards.Agents’ types evolve based on the incentives to which they are matched. Suppose that denotes the type distribution on at epoch and is the incentive to which agent is matched by (i.e., ). Then,
For epoch , the rewards are averaged over the iterations in that epoch. Let denote the reward received by agent when it is matched to incentive given type . We assume that and is drawn from a distribution . The reward distributions for different edges and states in are assumed to be independent of each other. Suppose that an algorithm selects the edge for iterations within an epoch. The observed reward at the end of this epoch is taken to be the timeaveraged reward over the epoch. Specifically, suppose that the –th epoch proceeds for iterations beginning with time —i.e. one plus the total iterations completed before this—and ending at time , and that denotes agent ’s state at time . Then, the timeaveraged reward in the epoch is given by We use the state as a superscript to denote dependence of the reward on the agent’s type at the beginning of the epoch. Finally, the total (timeaveraged) reward due to a matching at the end of an epoch can be written as
We assume that the Markov chain corresponding to each edge is aperiodic and irreducible (Levin et al., 2009). We denote the stationary or steadystate distribution for this edge as . Hence, we define the expected reward for edge , given its stationary distribution, to be where the expectation is with respect to the distribution on the reward given .
2.1 Capacitated Matching
Given , the designer’s goal at the beginning of each epoch is to select a matching —i.e. a collection of edges—that satisfies some cardinality constraints. We partition the edges in into a mutually exclusive set of classes allowing for edges possessing identical characteristics to be grouped together. In the bikesharing example, the various classes could denote types of incentives—e.g., edges that match agents to discounts, freerides, etc. Suppose that denotes a partitioning of the edge set such that (i) for all , (ii) , and (iii) for all . We refer to each as a class and for any given edge , use to denote the class that this edge belongs to, i.e., and .
Given a capacity vector
indexed on the set of classes, we say that a matching is a feasible solution to the capacitated matching problem if:
[itemsep=5pt,topsep=5pt, leftmargin=15pt]

for every (resp., ), the matching must contain at most one edge containing this agent (resp., incentive)

and, the total number of edges from each class contained in the matching cannot be larger than .
In summary, the capacitated matching problem can be formulated as the following integer program:
(P1)  
s.t.  
We use the notation for a capacitated matching problem instance. In (P1), refers to the weight or the reward to be obtained from the given edge. The term is an indicator on whether the edge is included in the solution to (P1). Clearly, the goal is to select a maximum weight matching subject to the constraints. In our online bandit problem, the designer’s actual goal in a fixed epoch is to maximize the quantity , i.e., . However, since the reward distributions and the current user type are not known beforehand, our MGEUCB algorithm (detailed in Section 4.2) approximates this objective by setting the weights to be the average observed reward from the edges in combination with the corresponding confidence bounds.
2.2 Technical Challenges
There are two key obstacles involved in extending traditional bandit approaches to our combinatorial setting with evolving rewards, namely, cascading suboptimality and correlated convergence. The first phenomenon occurs when an agent is matched to a suboptimal arm (incentive) because its optimal arm has already been assigned to another agent. Such suboptimal pairings have the potential to cascade, e.g., when another agent who is matched to in the optimal solution can no longer receive this incentive and so on. Therefore, unlike the classical bandit analysis, the selection of suboptimal arms cannot be directly mapped to the empirical rewards.
Correlated Convergence. As mentioned previously, in our model, the rewards depend on the type or state of an agent, and hence, the reward distribution on any given edge
can vary even when the algorithm does not select this edge. As a result, a naïve application of a bandit algorithm can severely underestimate the expected reward on each edge and eventually converge to a suboptimal matching. A concrete example of the poor convergence effect is provided in Example
2.2. In Section 4.2, we describe how our central bandit algorithm limits the damage due to cascading while simultaneously avoiding the correlated convergence problem.[Failure of Classical UCB]
Consider a problem instance with two agents , two incentives and identical state space i.e., . The transition matrices and deterministic rewards for the agents for being matched to each incentive are depicted pictorially below: we assume that is a sufficiently small constant.
Clearly, the optimal strategy is to repeatedly chose the matching achieving a reward of (almost) two in each epoch. An implementation of traditional UCB for the matching problem—e.g., the approach in (Gai et al., 2011; Chen et al., 2016; Kveton et al., 2015)—selects a matching based on the empirical rewards and confidence bounds for a total of iterations, which are then divided into epochs for convenience. This approach converges to the suboptimal matching of . Indeed, every time the algorithm selects this matching, both the agents’ states are reset to and when the algorithm explores the optimum matching, the reward consistently happens to be zero since the agents are in state . Hence, the rewards for the (edges in the) optimum matching are grossly underestimated.
3 Greedy Offline Matching
In this section, we consider the capacitated matching problem in the offline case, where the edge weights are provided as input. The techniques developed in this section serve as a base in order to solve the more general online problem in the next section. More specifically, we assume that we are given an arbitrary instance of the capacitated matching problem Given this instance, the designer’s objective is to solve (P1). Surprisingly, this problem turns out to be NPHard and thus cannot be optimally solved in polynomial time (Garey and Johnson, 1979)—this marks a stark contrast with the classic maximum weighted matching problem, which can be solved efficiently using the Hungarian method (Kuhn, 1955).
In view of these computational difficulties, we develop a simple greedy approach for the capacitated matching problem and formally prove that it results in a onethird approximation to the optimum solution. The greedy method studied in this work comes with a multitude of desirable properties that render it suitable for matching problems arising in largescale economies. Firstly, the greedy algorithm has a running time of , where is the number of agents—this nearlinear execution time in the number of edges makes it ideal for platforms comprising of a large number of agents. Secondly, since the output of the greedy algorithm depends only on the ordering of the edge weights and is not sensitive to their exact numerical value, learning approaches tend to converge faster to the ‘optimum solution’. This property is validated by our simulations (see Figure (c)c). Finally, the performance of the greedy algorithm in practice (e.g., see Figure (b)b) appears to be much closer to the optimum solution than the 1/3 approximation guaranteed by Theorem 3.1 below.
3.1 Analysis of Greedy Algorithm
The greedy matching is outlined in Algorithm 1. Given an instance , Algorithm 1 ‘greedily’ selects the highest weight feasible edge in each iteration—this step is repeated until all available edges that are feasible are added to . Our main result in this section is that for any given instance of the capacitated matching problem, the matching returned by Algorithm 1 has a total weight that is at least 1/3–rd that of the maximum weight matching.
For any given capacitated matching problem instance , let denote the output of Algorithm 1 and be any other feasible solution to the optimization problem in (P1) including the optimum matching. Then, The proof is based on a charging argument that takes into account the capacity constraints and can be found in Section B.1 of the supplementary material. At a high level, we take each edge belonging to the benchmark and identify a corresponding edge in whose weight is larger than that of the benchmark edge. This allows us to charge the weight of the original edge to an edge in . During the charging process, we ensure that no more than three edges in are charged to each edge in . This gives us an approximation factor of three.
3.2 Properties of Greedy Matchings
We conclude this section by providing a hierarchical decomposition of the edges in for a fixed instance . In Section 4.1, we will use this property to reconcile the offline version of the problem with the online bandit case. Let denote the matching computed by Algorithm 1 for the given instance such that without loss of generality^{2}^{2}2If , we abuse notation and let .. Next, let for all —i.e. the highestweight edges in the greedy matching.
For each , we define the infeasibility set as the set of edges in that when added to violates the feasibility constraints of (P1). Finally, we use to denote the marginal infeasibility sets—i.e. and
(1) 
We note that the marginal infeasibility sets denote a mutually exclusive partition of the edge set minus the greedy matching—i.e., . Moreover, since the greedy matching selects its edges in the decreasing order of weight, for any , and every , we have that .
Armed with our decomposition of the edges in , we now present a crucial structural lemma. The following lemma identifies sufficient conditions on the local ordering of the edge weights for two different instances under which the outputs of the greedy matching for the instances are nonidentical.
Given instances and of the capacitated matching problem, let and denote the output of Algorithm 1 for these instances, respectively. Let be conditions described as follows:
If , then at least one of or must be true. Lemma 3.2 is fundamental in the analysis of our MGEUCB algorithm because it provides a method to map the selection of each suboptimal edge to a familiar condition comparing empirical rewards to stationary rewards.
4 Online Matching—bandit Algorithm
In this section, we propose a multiarmed bandit algorithm for the capacitated matching problem and analyze its regret. For concreteness, we first highlight the information and action sets available to the designer in the online problem. The designer is presented with a partial instance of the matching problem without the weights, i.e., along with a fixed time horizon of epochs but has the ability to set the parameters , where is the number of iterations under epoch . The designer’s goal is to design a policy that selects a matching in the –th epoch that is a feasible solution for (P1). At the end of the –th epoch, the designer observes the average reward for each but not the agent’s type. We abuse notation and take to be the agent’s state at the beginning of epoch . The designer’s objective is to minimize the regret over the finite horizon.
The expected regret of a policy is the difference in the expected aggregate reward of a benchmark matching and that of the matching returned by the policy, summed over epochs. Owing to its favorable properties (see Section 3), we use the greedy matching on the stationary state rewards as our benchmark. Measuring the regret with respect to the unknown stationarydistribution is standard with MDPs (e.g., see (Tekin and Liu, 2010, 2012; Gai et al., 2011)). Formally, let denote the output of Algorithm 1 on the instance —i.e., with the weights equal the stationary state rewards .
The expected regret of a policy with respect to the greedy matching is given by
where the expectation is with respect to the reward and the state of the agents during each epoch.
4.1 Regret Decomposition
As is usual in this type of analysis, we start by decomposing the regret in terms of the number of selections of each suboptimal arm (edge). We state some assumptions and define notation before proving our generic regret decomposition theorem. A complete list of the notation used can be found in Section A of the supplementary material.

[itemsep=5pt, topsep=5pt,leftmargin=15pt]

For analytic convenience, we assume that the number of agents and incentives is balanced and therefore, . WLOG, every agent is matched to some incentive in ; if this is not the case, we can add dummy incentives with zero reward.

Suppose that such that and let denote the incentive that is matched to in . Let be the marginal infeasibility sets as defined in (1).

Suppose that and for some nonnegative integer .
Let be the indicator function—e.g., is one when the edge belongs to the matching , and zero otherwise. Define
to be the random variable that denotes the number of epochs in which an edge is selected under an algorithm
. By relating to the regret , we are able to provide bounds on the performance of .By adding and subtracting from the equation in Definition 4, we get that
To further simplify the regret, we separate the edges in by introducing the notion of a suboptimal edge. Formally, for any given , define and . Then, the regret bound in the above equation can be simplified by ignoring the contribution of the terms in . That is, since for all ,
(2) 
Recall from the definition of the marginal infeasibility sets in (1) that for any given , there exists a unique edge such that . Define such that . Now, we can define the reward gap for any given edge as follows:
This leads us to our main regret decomposition result which leverages mixing times for Markov chains (Fill, 1991) along with Equation (2) in deriving regret bounds. For an aperiodic, irreducible Markov chain , using the notion that it convergences to its stationary state under repeated plays of a fixed action, we can prove that for every arm , there exists a constant such that —in fact, this result holds for all type distributions of the agent. Suppose for each , is an aperiodic, irreducible Markov chain with corresponding constant . Then, for a given algorithm where for some fixed , we have that
The proof of this proposition is in Section B.2 of the supplementary material.
belonging to either a Bernoulli, Uniform, or Beta distribution; (iv)
and .4.2 MgEucb Algorithm and Analysis
In the initialization phase, the algorithm computes and plays a sequence of matchings for a total of epochs. The initial matchings ensure that every edge in is selected at least once—the computation of these initial matchings relies on a greedy covering algorithm that is described in Section C.1 of the supplementary material. Following this, our algorithm maintains the cumulative empirical reward for every . At the beginning of (say) epoch , the algorithm computes a greedy matching for the instance where , i.e., the average empirical reward for the edge added to a suitably chosen confidence window. The incent function (Algorithm 4, described in the supplementary material since it is a trivial function) plays each edge in the greedy matching for iterations, where increases linearly with . This process is repeated for epochs. Prior to theoretically analyzing MGEUCB, we return to Example 2.2 in order to provide intuition for how the algorithm overcomes correlated convergence of rewards.
Revisiting Example 1: Why does MGEUCB work? In Example 1, the algorithm initially estimates the empirical reward of and to be zero respectively. However, during the UCB exploration phase, the matching is played again for epoch length and the state of agent moves from to during the epoch. Therefore, the algorithm estimates the average reward of each edge within the epoch to be , and the empirical reward increases. This continues as the epoch length increases, so that eventually the empirical reward for exceeds that of and the algorithm correctly identifies the optimal matching as we move from exploration to exploitation.
In order to characterize the regret of the MGEUCB algorithm, Proposition 4.1 implies that it is sufficient to bound the expected number of epochs in which our algorithm selects each suboptimal edge. The following theorem presents an upper bound on this quantity.
Consider a finite set of agents and incentives with corresponding aperiodic, irreducible Markov chains for each . Let be the MGEUCB algorithm with mixing time sequence where , , and . Then for every ,
where , and is a constant specific to edge .
The full proof of the theorem is provided can be found in the supplementary material.
Proof.
(sketch.) There are three key ingredients to the proof: (i) linearly increasing epoch lengths, (ii) overcoming cascading errors, and (iii) application of the AzumaHoeffding concentration inequality.
By increasing the epoch length linearly, MGEUCB ensures that as the algorithm converges to the optimal matching, it also plays each arm for a longer duration within an epoch. This helps the algorithm to progressively discard suboptimal arms without selecting them too many times when the epoch length is still small. At the same time, the epoch length is long enough to allow for sufficient mixing and separation between multiple nearoptimal matchings. If we fix the epoch length as a constant, the resulting regret bounds are considerably worse because the agent states may never converge to the steadystate distributions.
To address cascading errors, we provide a useful characterization. For a given , suppose that refers to the average empirical reward obtained from edge up to epoch plus the upper confidence bound parameter, given that edge has been selected for exactly times in epochs to . For any given epoch where the algorithm selects a suboptimal matching, i.e., , we can apply Lemma 3.2 and get that at least one of the following conditions must be true:

[itemsep=5pt,topsep=2pt, leftmargin=15pt]


This is a particularly useful characterization because it maps the selection of each suboptimal edge to a familiar condition that compares the empirical rewards to the stationary rewards. Therefore, once each arm is selected for epochs, the empirical rewards approach the ‘true’ rewards and our algorithm discards suboptimal edges. Mathematically, this can be written as
where is some carefully chosen constant, and .
With this characterization, for each , we find an upper bound on the probability of the event . However, this is a nontrivial task since the reward obtained in any given epoch is not independent of the previous actions. Specifically, the underlying Markov process that generates the rewards is common across the edges connected to any given agent in the sense, that the initial distribution for each Markov chain that results from pulling an edge is the distribution at the end of the preceding pull. Therefore, we employ AzumaHoeffding (Azuma, 1967; Hoeffding, 1963), a concentration inequality that does not require independence in the armbased observed rewards. Moreover, unlike the classical UCB analysis, the empirical reward can differ from the expected stationary reward due to the distributions and . To account for this additional error term, we use bounds on the convergence rates of Markov chains to guide the choice of the confidence parameter in Algorithm 2. Applying the AzumaHoeffding inequality, we can show that with high probability, the difference between the empirical reward and the stationary reward of edge is no larger than . ∎
5 Experiments
In this section, we present a set of illustrative experiments with our algorithm (MGEUCB) on synthetic and real data. We observe much faster convergence with the greedy matching as compared to the Hungarian algorithm. Moreover, as is typical in the bandit literature (e.g., (Auer et al., 2002)), we show that a tuned version of our algorithm (MGEUCB+), in which we reduce the coefficient on the term in the UCB ‘confidence parameter’ from six to three, further improves the convergence of our algorithm. Finally we show that our algorithm can be effectively used as an incentive design scheme to improve the performance of a bikeshare system.
5.1 Synthetic Experiments
We first highlight the failure of classical UCB approaches (CUCB)—e.g., as in (Gai et al., 2011)—for problems with correlated reward evolution. In Figure (a)a, we demonstrate that CUCB converges almost immediately to a suboptimal solution, while this is not the case for our algorithm (MGEUCB+). In Figure (b)b, we compare MGEUCB and MGEUCB+ with a variant of Algorithm 2 that uses the Hungarian method (HEUCB) for matchings. While HEUCB does have a ‘marginally’ higher mean reward, Figure (c)c reveals that the MGEUCB and MGEUCB+ algorithms converge much faster to the optimum solution of the greedy matching than the Hungarian alternatives.
5.2 BikeShare Experiments
In this problem, we seek to incentivize participants in a bikesharing system; our goal is to alter their intended destination in order to balance the spatial supply of available bikes appropriately and meet future user demand. We use data from the Bostonbased bikesharing service Hubway (hub, ) to construct the example. Formally, we consider matching each agent to an incentive , meaning the algorithm proposes that agent travel to station as opposed to its intended destination (potentially, for some monetary benefit). The agent’s state controls the probability of accepting the incentive by means of a distance threshold parameter and a parameter of a Bernouilli distribution, both of which are drawn uniformly at random. More details on the data and problem setup can be found in Section D of the supplementary material.
Our bikeshare simulations presented in Figure 3 show approximately a % improvement in system performance when compared to an environment without incentives and convergence towards an upper bound on system performance. Moreover, our algorithm achieves this significant performance increase while on average matching less than % of users in the system to an incentive.
6 Conclusion
We combine ideas from greedy matching, the UCB multiarmed bandit strategy, and the theory of Markov chain mixing times to propose a bandit algorithm for matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. For this algorithm, we derive logarithmic gapdependent regret bounds despite the additional technical challenges of cascading suboptimality and correlated convergence. Finally, we demonstrate the empirical performance via examples.
Acknowledgments
This work is supported by NSF Awards CNS1736582 and CNS1656689. T. Fiez was also supported in part by an NDSEG Fellowship.
References
 (1) Hubway: Metroboston’s bikeshare program. [available online: https://thehubway.com].
 Auer et al. (2002) P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, May 2002. doi: 10.1023/A:1013689704352.
 Azar et al. (2013) M. G. Azar, A. Lazaric, and E. Brunskill. Regret bounds for reinforcement learning with policy advice. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112, 2013.
 Azuma (1967) K. Azuma. Weighted sums of certain dependent random variables. Tohoku Math. J., 19(3):357–367, 1967. doi: 10.2748/tmj/1178243286.
 Badanidiyuru et al. (2013) A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In Proc. 54th Annual IEEE Symp. Foundations of Computer Science, pages 207–216, 2013.
 Chen et al. (2016) W. Chen, Y. Wang, Y. Yuan, and Q. Wang. Combinatorial multiarmed bandit and its extension to probabilistically triggered arms. J. Machine Learning Research, 17:50:1–50:33, 2016. URL http://jmlr.org/papers/v17/14298.html.
 Fill (1991) J. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. Ann. Appl. Probab., 1(1):62–87, 1991.
 Folland (2007) G. Folland. Real Analysis. Wiley, 2nd edition, 2007.
 Gai et al. (2011) Y. Gai, B. Krishnamachari, and M. Liu. On the combinatorial multiarmed bandit problem with markovian rewards. In Proc. Global Communications Conf., pages 1–6, 2011. doi: 10.1109/GLOCOM.2011.6134244.
 Garey and Johnson (1979) M. R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NPCompleteness. W. H. Freeman, 1979. ISBN 0716710447.
 Ghosh and Hummel (2013) A. Ghosh and P. Hummel. Learning and incentives in usergenerated content: multiarmed bandits with endogenous arms. In Proc. of ITCS 2013, pages 233–246, 2013.
 Hoeffding (1963) W. Hoeffding. Probability inequalities for sums of bounded random variables. J. American Statistical Association, 58(301):13–30, 1963. doi: 10.2307/2282952.
 Immorlica et al. (2015) Nicole Immorlica, Gregory Stoddard, and Vasilis Syrgkanis. Social status and badge design. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 1822, 2015, pages 473–483, 2015.
 Jain et al. (2014) S. Jain, B. Narayanaswamy, and Y. Narahari. A multiarmed bandit incentive mechanism for crowdsourcing demand response in smart grids. In Proc. of AAAI 2014, pages 721–727, 2014.
 Jaksch et al. (2010) T. Jaksch, R. Ortner, and P. Auer. Nearoptimal Regret Bounds for Reinforcement Learning. J. Machine Learning Research, 11:1563–1600, 2010.
 Kuhn (1955) H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics, 2(12):83–97, 1955.
 Kveton et al. (2014) B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial optimization with learning. In Proc. of UAI 2014, pages 420–429, 2014.
 Kveton et al. (2015) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semibandits. In Artificial Intelligence and Statistics, pages 535–543, 2015.
 Levin et al. (2009) D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.
 Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proc. 19th Intern. Conf. World Wide Web, pages 661–670, 2010.
 Lu et al. (2010) T. Lu, D. Pál, and M. Pál. Contextual multiarmed bandits. In Proc. of AISTATS 2010, pages 485–492, 2010.
 Mazumdar et al. (2017) E. Mazumdar, R. Dong, V. Rúbies Royo, C. Tomlin, and S. S. Sastry. A MultiArmed Bandit Approach for Online Expert Selection in Markov Decision Processes. arxiv:1707.05714, 2017.
 Mehta and Mirrokni (2011) A. Mehta and V. Mirrokni. Online ad serving: Theory and practice, 2011.
 Ratliff et al. (2018) L. J. Ratliff, S. Sekar, L. Zheng, and T. Fiez. Incentives in the dark: Multiarmed bandits for evolving users with unknown type. arxiv, 2018.
 Sani et al. (2012) Amir Sani, Alessandro Lazaric, and Rémi Munos. Riskaversion in multiarmed bandits. In Proc. of NIPS 2012, pages 3284–3292, 2012.
 Scott (2015) S. L. Scott. Multiarmed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
 Singla et al. (2015) A. Singla, M. Santoni, G. Bartók, P. Mukerji, M. Meenen, and Andreas Krause. Incentivizing users for balancing bike sharing systems. In Proc. of AAAI 2015, pages 723–729, 2015.
 Tekin and Liu (2010) Cem Tekin and Mingyan Liu. Online algorithms for the multiarmed bandit problem with markovian rewards. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 1675–1682. IEEE, 2010.
 Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Online Learning of Rested and Restless Bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
 TranThanh et al. (2014) L. TranThanh, S. Stein, A. Rogers, and N. R. Jennings. Efficient crowdsourcing of unknown experts using bounded multiarmed bandits. Artif. Intell., 214:89–111, 2014.
 Wen et al. (2015) Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in largescale combinatorial semibandits. In International Conference on Machine Learning, pages 1113–1122, 2015.
Appendix A Notational Table
notation  meaning 

set of agents  
set of incentives  
allowed agentincentive pairs  
state (type) space of agent  
transition probability kernel  
agent ’s type distribution at epoch  
stationary distribution of  
expected reward from  
number of iterations matching offered  
in epoch , ,  
random reward  
agent ’s reward distribution  
timeaveraged reward during epoch  
maximum number of edges of class  
greedy matching on weights  
the edge having the –th  
largest weight in .  
incentive agent is matched to in  
set of that become infeasi  
ble when is added to matching  
but not before that  
set of edges such that  
number of agents & incentives  
the total number of epochs  
state of agent at the beginning  
of epoch  
constants specific to each edge  
regret of given matching policy  
at the end of epochs  
number of times edge  
selected in first epochs  
reward on edge when selected  
for the –th time given  
average reward on first times  
is selected, i.e.,  
agent ’s state at the beginning  
of epoch  
Comments
There are no comments yet.