Combinatorial Bandits for Incentivizing Agents with Dynamic Preferences

07/06/2018 ∙ by Tanner Fiez, et al. ∙ 0

The design of personalized incentives or recommendations to improve user engagement is gaining prominence as digital platform providers continually emerge. We propose a multi-armed bandit framework for matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. We design an algorithm that combines ideas from three distinct domains: (i) a greedy matching paradigm, (ii) the upper confidence bound algorithm (UCB) for bandits, and (iii) mixing times from the theory of Markov chains. For this algorithm, we provide theoretical bounds on the regret and demonstrate its performance via both synthetic and realistic (matching supply and demand in a bike-sharing platform) examples.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The theory of multi-armed bandits plays a key role in enabling personalization in the digital economy (Scott, 2015). Algorithms from this domain have successfully been deployed in a diverse array of applications including online advertising (Mehta and Mirrokni, 2011; Lu et al., 2010), crowdsourcing (Tran-Thanh et al., 2014), content recommendation (Li et al., 2010), and selecting user-specific incentives (Ghosh and Hummel, 2013; Jain et al., 2014) (e.g., a retailer offering discounts). On the theoretical side, this has been complemented by a litany of near-optimal regret bounds for multi-armed bandit settings with rich combinatorial structures and complex agent behavior models (Chen et al., 2016; Gai et al., 2011; Kveton et al., 2015; Sani et al., 2012). At a high level, the broad appeal of bandit approaches for allocating resources to human agents stems from its focus on balancing exploration with exploitation, thereby allowing a decision-maker to efficiently identify users’ preferences without sacrificing short-term rewards.

Implicit in most of these works is the notion that in large-scale environments, a designer can simultaneously allocate resources to multiple users by running independent bandit instances. In reality, such independent decompositions do not make sense in applications where resources are subject to physical or monetary constraints. In simple terms, matching an agent to a resource immediately constrains the set of resources to which another agent can be matched. Such supply constraints may arise even when dealing with intangible products. For instance, social media platforms (e.g., Quora) seek to maximize user participation by offering incentives in the form of increased recognition—e.g., featured posts or badges (Immorlica et al., 2015). Of course, there are supply constraints on the number of posts or users that can be featured at a given time. As a consequence of these coupling constraints, much of the existing work on multi-armed bandits does not extend naturally to multi-agent economies.

Yet, another important aspect not addressed by the literature concerns human behavior. Specifically, users’ preferences over the various resources may be dynamic—i.e. evolve in time as they are repeatedly exposed to the available options. The problem faced by a designer in such a dynamic environment is compounded by the lack of information regarding each user’s current state or beliefs, as well as how these beliefs influence their preferences and their evolution in time.

Bearing in mind these limitations, we study a multi-armed bandit problem for matching multiple agents to a finite set of incentives111We use the term incentive broadly to refer to any resource or action available to the agent. That is, incentives are not limited to monetary or financial mechanisms.: each incentive belongs to a category and global capacity constraints control the number of incentives that can be chosen from each category. In our model, each agent has a preference profile or a type

that determines its rewards for being matched to different incentives. The agent’s type evolves according to a Markov decision process (MDP), and therefore, the rewards vary over time

in a correlated fashion.

Our work is primarily motivated by the problem faced by a technological platform that seeks to not just maximize user engagement but also to encourage users to make changes in their status quo decision-making process by offering incentives. For concreteness, consider a bike-sharing service—an application we explore in our simulations—that seeks to identify optimal incentives for each user from a finite bundle of options—e.g., varying discount levels, free future rides, bulk ride offers, etc. Users’ preferences over the incentives may evolve with time depending on their current type, which in turn depends on their previous experience with the incentives. In addition to their marketing benefits, such incentives can serve as a powerful instrument for nudging users to park their bikes at alternative locations—this can lead to spatially balanced supply and consequently, lower rejection rates (Singla et al., 2015).

1.1 Contributions and Organization

Our objective is to design a multi-armed bandit algorithm that repeatedly matches agents to incentives in order to minimize the cumulative regret over a finite time horizon. Here, regret is defined as the difference in the reward obtained by a problem specific benchmark strategy and the proposed algorithm (see Definition 4). A preliminary impediment in achieving this goal is the fact that the capacitated matching problem studied in this work is NP-Hard even in the offline case. The major challenge therefore is whether we can achieve sub-linear (in the length of the horizon) regret in the more general matching environment without any information on the users’ underlying beliefs or how they evolve?

Following preliminaries (Section 2), we introduce a simple greedy algorithm that provides a –approximation to the optimal offline matching solution (Section 3). Leveraging this first contribution, the central result in this paper (Section 4) is a new multi-armed bandit algorithm—MatchGreedy-EpochUCB (MG-EUCB)—for capacitated matching problems with time-evolving rewards. Our algorithm obtains logarithmic (and hence sub-linear) regret even for this more general bandit problem. The proposed approach combines ideas from three distinct domains: (i) the –rd approximate greedy matching algorithm, (ii) the traditional UCB algorithm (Auer et al., 2002), and (iii) mixing times from the theory of Markov chains.

We validate our theoretical results (Section 5) by performing simulations on both synthetic and realistic instances derived using data obtained from a Boston-based bike-sharing service Hubway (hub, ). We compare our algorithm to existing UCB-based approaches and show that the proposed method enjoys favorable convergence rates, computational efficiency on large data sets, and does not get stuck at sub-optimal matching solutions.

1.2 Background and Related Work

Two distinct features separate our model from the majority of work on the multi-armed bandit problem: (i) our focus on a capacitated matching problem with finite supply (every user cannot be matched to their optimal incentive), and (ii) the rewards associated with each agent evolve in a correlated fashion but the designer is unaware of each agent’s current state. Our work is closest to (Gai et al., 2011) which considers a matching problem with Markovian rewards. However, in their model the rewards associated with each edge evolve independently of the other edges; as we show via a simple example in Section 2.2, the correlated nature of rewards in our instance can lead to additional challenges and convergence to sub-optimal matchings if we employ a traditional approach as in (Gai et al., 2011).

Our work also bears conceptual similarities to the rich literature on combinatorial bandits (Badanidiyuru et al., 2013; Chen et al., 2016; Kveton et al., 2014, 2015; Wen et al., 2015)

. However, unlike our work, these papers consider a model where the distribution of the rewards is static in time. For this reason, efficient learning algorithms leveraging oracles to solve generic constrained combinatorial optimization problems developed for the combinatorial semi-bandit setting 

(Chen et al., 2016; Kveton et al., 2015) face similar limitations in our model as the approach of (Gai et al., 2011). Moreover, the rewards in our problem may not have a linear structure so the approach of (Wen et al., 2015) is not applicable.

The novelty in this work is not the combinatorial aspect but the interplay between combinatorial bandits and the edge rewards evolving according to an MDP. When an arm is selected by an oracle, the reward of every edge in the graph evolves——how it evolves depends on which arm is chosen. If the change occurs in a sub-optimal direction, this can affect future rewards. Indeed, the difficulties in our proofs do not stem from applying an oracle for combinatorial optimization, but from bounding the secondary regret that arises when rewards evolve in a sub-optimal way.

Finally, there is a somewhat parallel body of work on single-agent reinforcement learning techniques 

(Jaksch et al., 2010; Mazumdar et al., 2017; Azar et al., 2013; Ratliff et al., 2018) and expert selection where the rewards on the arms evolve in a correlated fashion as in our work. In addition to our focus on multi-agent matchings, we remark that many of these works assume that the designer is aware (at least partially) of the agent’s exact state and thus, can eventually infer the nature of the evolution. Consequently, a major contribution of this work is the extension of UCB-based approaches to solve MDPs with a fully unobserved state and rewards associated with each edge that evolve in a correlated fashion.

2 Preliminaries

A designer faces the problem of matching agents to incentives (more generally jobs, goods, content, etc.) without violating certain capacity constraints. We model this setting by means of a bipartite graph where is the set of agents, is the set of incentives to which the agents can be matched, and is the set of all pairings between agents and incentives. We sometimes refer to as the set of arms. In this regard, a matching is a set such that every agent and incentive is present in at most one edge belonging to .

Each agent is associated with a type or state , which influences the reward received by this agent when matched with some incentive . When agent is matched to incentive

, its type evolves according to a Markov process with transition probability kernel

. Each pairing or edge of the bipartite graph is associated with some reward that depends on the agent–incentive pair, , as well as the type .

The designer’s policy (algorithm) is to compute a matching repeatedly over a finite time horizon in order to maximize the expected aggregate reward. In this work, we restrict our attention to a specific type of multi-armed bandit algorithm that we refer to as an epoch mixing policy. Formally, the execution of such a policy is divided into a finite number of time indices , where is the length of the time horizon. In each time index , the policy selects a matching and repeatedly ‘plays’ this matching for iterations within this time index. We refer to the set of iterations within a time index collectively as an epoch. That is, within the

–th epoch, for each edge

, agent is matched to incentive and the agent’s type is allowed to evolve for iterations. In short, an epoch mixing policy proceeds in two time scales—each selection of a matching corresponds to an epoch comprising of iterations for , and there are a total of epochs. It is worth noting that an epoch-based policy was used in the UCB2 algorithm (Auer et al., 2002), albeit with stationary rewards.

Agents’ types evolve based on the incentives to which they are matched. Suppose that denotes the type distribution on at epoch and is the incentive to which agent is matched by (i.e., ). Then,

For epoch , the rewards are averaged over the iterations in that epoch. Let denote the reward received by agent when it is matched to incentive given type . We assume that and is drawn from a distribution . The reward distributions for different edges and states in are assumed to be independent of each other. Suppose that an algorithm selects the edge for iterations within an epoch. The observed reward at the end of this epoch is taken to be the time-averaged reward over the epoch. Specifically, suppose that the –th epoch proceeds for iterations beginning with time —i.e. one plus the total iterations completed before this—and ending at time , and that denotes agent ’s state at time . Then, the time-averaged reward in the epoch is given by We use the state as a superscript to denote dependence of the reward on the agent’s type at the beginning of the epoch. Finally, the total (time-averaged) reward due to a matching at the end of an epoch can be written as

We assume that the Markov chain corresponding to each edge is aperiodic and irreducible (Levin et al., 2009). We denote the stationary or steady-state distribution for this edge as . Hence, we define the expected reward for edge , given its stationary distribution, to be where the expectation is with respect to the distribution on the reward given .

2.1 Capacitated Matching

Given , the designer’s goal at the beginning of each epoch is to select a matching —i.e. a collection of edges—that satisfies some cardinality constraints. We partition the edges in into a mutually exclusive set of classes allowing for edges possessing identical characteristics to be grouped together. In the bike-sharing example, the various classes could denote types of incentives—e.g., edges that match agents to discounts, free-rides, etc. Suppose that denotes a partitioning of the edge set such that (i) for all , (ii) , and (iii) for all . We refer to each as a class and for any given edge , use to denote the class that this edge belongs to, i.e., and .

Given a capacity vector

indexed on the set of classes, we say that a matching is a feasible solution to the capacitated matching problem if:

  • [itemsep=-5pt,topsep=-5pt, leftmargin=15pt]

  • for every (resp., ), the matching must contain at most one edge containing this agent (resp., incentive)

  • and, the total number of edges from each class contained in the matching cannot be larger than .

In summary, the capacitated matching problem can be formulated as the following integer program:


We use the notation for a capacitated matching problem instance. In (P1), refers to the weight or the reward to be obtained from the given edge. The term is an indicator on whether the edge is included in the solution to (P1). Clearly, the goal is to select a maximum weight matching subject to the constraints. In our online bandit problem, the designer’s actual goal in a fixed epoch is to maximize the quantity , i.e., . However, since the reward distributions and the current user type are not known beforehand, our MG-EUCB algorithm (detailed in Section 4.2) approximates this objective by setting the weights to be the average observed reward from the edges in combination with the corresponding confidence bounds.

2.2 Technical Challenges

There are two key obstacles involved in extending traditional bandit approaches to our combinatorial setting with evolving rewards, namely, cascading sub-optimality and correlated convergence. The first phenomenon occurs when an agent is matched to a sub-optimal arm (incentive) because its optimal arm has already been assigned to another agent. Such sub-optimal pairings have the potential to cascade, e.g., when another agent who is matched to in the optimal solution can no longer receive this incentive and so on. Therefore, unlike the classical bandit analysis, the selection of sub-optimal arms cannot be directly mapped to the empirical rewards.

Correlated Convergence. As mentioned previously, in our model, the rewards depend on the type or state of an agent, and hence, the reward distribution on any given edge

can vary even when the algorithm does not select this edge. As a result, a naïve application of a bandit algorithm can severely under-estimate the expected reward on each edge and eventually converge to a sub-optimal matching. A concrete example of the poor convergence effect is provided in Example 

2.2. In Section 4.2, we describe how our central bandit algorithm limits the damage due to cascading while simultaneously avoiding the correlated convergence problem.

[Failure of Classical UCB]

Consider a problem instance with two agents , two incentives and identical state space i.e., . The transition matrices and deterministic rewards for the agents for being matched to each incentive are depicted pictorially below: we assume that is a sufficiently small constant.










Figure 1: (a) State transition diagram and reward for each edge: note that the state is associated with the agent and not the edge.

Clearly, the optimal strategy is to repeatedly chose the matching achieving a reward of (almost) two in each epoch. An implementation of traditional UCB for the matching problem—e.g., the approach in (Gai et al., 2011; Chen et al., 2016; Kveton et al., 2015)—selects a matching based on the empirical rewards and confidence bounds for a total of iterations, which are then divided into epochs for convenience. This approach converges to the sub-optimal matching of . Indeed, every time the algorithm selects this matching, both the agents’ states are reset to and when the algorithm explores the optimum matching, the reward consistently happens to be zero since the agents are in state . Hence, the rewards for the (edges in the) optimum matching are grossly underestimated.

3 Greedy Offline Matching

In this section, we consider the capacitated matching problem in the offline case, where the edge weights are provided as input. The techniques developed in this section serve as a base in order to solve the more general online problem in the next section. More specifically, we assume that we are given an arbitrary instance of the capacitated matching problem Given this instance, the designer’s objective is to solve (P1). Surprisingly, this problem turns out to be NP-Hard and thus cannot be optimally solved in polynomial time (Garey and Johnson, 1979)—this marks a stark contrast with the classic maximum weighted matching problem, which can be solved efficiently using the Hungarian method (Kuhn, 1955).

In view of these computational difficulties, we develop a simple greedy approach for the capacitated matching problem and formally prove that it results in a one-third approximation to the optimum solution. The greedy method studied in this work comes with a multitude of desirable properties that render it suitable for matching problems arising in large-scale economies. Firstly, the greedy algorithm has a running time of , where is the number of agents—this near-linear execution time in the number of edges makes it ideal for platforms comprising of a large number of agents. Secondly, since the output of the greedy algorithm depends only on the ordering of the edge weights and is not sensitive to their exact numerical value, learning approaches tend to converge faster to the ‘optimum solution’. This property is validated by our simulations (see Figure (c)c). Finally, the performance of the greedy algorithm in practice (e.g., see Figure (b)b) appears to be much closer to the optimum solution than the 1/3 approximation guaranteed by Theorem 3.1 below.

1:function MG((, )
3:while :
4: Select
5:if then
7: or else
10:end function
Algorithm 1 Capacitated-Greedy Matching Algorithm

3.1 Analysis of Greedy Algorithm

The greedy matching is outlined in Algorithm 1. Given an instance , Algorithm 1 ‘greedily’ selects the highest weight feasible edge in each iteration—this step is repeated until all available edges that are feasible are added to . Our main result in this section is that for any given instance of the capacitated matching problem, the matching returned by Algorithm 1 has a total weight that is at least 1/3–rd that of the maximum weight matching.

For any given capacitated matching problem instance , let denote the output of Algorithm 1 and be any other feasible solution to the optimization problem in (P1) including the optimum matching. Then, The proof is based on a charging argument that takes into account the capacity constraints and can be found in Section B.1 of the supplementary material. At a high level, we take each edge belonging to the benchmark and identify a corresponding edge in whose weight is larger than that of the benchmark edge. This allows us to charge the weight of the original edge to an edge in . During the charging process, we ensure that no more than three edges in are charged to each edge in . This gives us an approximation factor of three.

3.2 Properties of Greedy Matchings

We conclude this section by providing a hierarchical decomposition of the edges in for a fixed instance . In Section 4.1, we will use this property to reconcile the offline version of the problem with the online bandit case. Let denote the matching computed by Algorithm 1 for the given instance such that without loss of generality222If , we abuse notation and let .. Next, let for all —i.e. the highest-weight edges in the greedy matching.

For each , we define the infeasibility set as the set of edges in that when added to violates the feasibility constraints of (P1). Finally, we use to denote the marginal infeasibility sets—i.e. and


We note that the marginal infeasibility sets denote a mutually exclusive partition of the edge set minus the greedy matching—i.e., . Moreover, since the greedy matching selects its edges in the decreasing order of weight, for any , and every , we have that .

Armed with our decomposition of the edges in , we now present a crucial structural lemma. The following lemma identifies sufficient conditions on the local ordering of the edge weights for two different instances under which the outputs of the greedy matching for the instances are non-identical.

Given instances and of the capacitated matching problem, let and denote the output of Algorithm 1 for these instances, respectively. Let be conditions described as follows:

If , then at least one of or must be true. Lemma 3.2 is fundamental in the analysis of our MG-EUCB algorithm because it provides a method to map the selection of each sub-optimal edge to a familiar condition comparing empirical rewards to stationary rewards.

4 Online Matching—bandit Algorithm

In this section, we propose a multi-armed bandit algorithm for the capacitated matching problem and analyze its regret. For concreteness, we first highlight the information and action sets available to the designer in the online problem. The designer is presented with a partial instance of the matching problem without the weights, i.e., along with a fixed time horizon of epochs but has the ability to set the parameters , where is the number of iterations under epoch . The designer’s goal is to design a policy that selects a matching in the –th epoch that is a feasible solution for (P1). At the end of the –th epoch, the designer observes the average reward for each but not the agent’s type. We abuse notation and take to be the agent’s state at the beginning of epoch . The designer’s objective is to minimize the regret over the finite horizon.

The expected regret of a policy is the difference in the expected aggregate reward of a benchmark matching and that of the matching returned by the policy, summed over epochs. Owing to its favorable properties (see Section 3), we use the greedy matching on the stationary state rewards as our benchmark. Measuring the regret with respect to the unknown stationary-distribution is standard with MDPs (e.g., see (Tekin and Liu, 2010, 2012; Gai et al., 2011)). Formally, let denote the output of Algorithm 1 on the instance —i.e., with the weights equal the stationary state rewards .

The expected regret of a policy with respect to the greedy matching is given by

where the expectation is with respect to the reward and the state of the agents during each epoch.

4.1 Regret Decomposition

As is usual in this type of analysis, we start by decomposing the regret in terms of the number of selections of each sub-optimal arm (edge). We state some assumptions and define notation before proving our generic regret decomposition theorem. A complete list of the notation used can be found in Section A of the supplementary material.

  1. [itemsep=-5pt, topsep=-5pt,leftmargin=15pt]

  2. For analytic convenience, we assume that the number of agents and incentives is balanced and therefore, . WLOG, every agent is matched to some incentive in ; if this is not the case, we can add dummy incentives with zero reward.

  3. Suppose that such that and let denote the incentive that is matched to in . Let be the marginal infeasibility sets as defined in (1).

  4. Suppose that and for some non-negative integer .

Let be the indicator function—e.g., is one when the edge belongs to the matching , and zero otherwise. Define

to be the random variable that denotes the number of epochs in which an edge is selected under an algorithm

. By relating to the regret , we are able to provide bounds on the performance of .

By adding and subtracting from the equation in Definition 4, we get that

To further simplify the regret, we separate the edges in by introducing the notion of a sub-optimal edge. Formally, for any given , define and . Then, the regret bound in the above equation can be simplified by ignoring the contribution of the terms in . That is, since for all ,


Recall from the definition of the marginal infeasibility sets in (1) that for any given , there exists a unique edge such that . Define such that . Now, we can define the reward gap for any given edge as follows:

This leads us to our main regret decomposition result which leverages mixing times for Markov chains (Fill, 1991) along with Equation (2) in deriving regret bounds. For an aperiodic, irreducible Markov chain , using the notion that it convergences to its stationary state under repeated plays of a fixed action, we can prove that for every arm , there exists a constant such that —in fact, this result holds for all type distributions of the agent. Suppose for each , is an aperiodic, irreducible Markov chain with corresponding constant . Then, for a given algorithm where for some fixed , we have that

The proof of this proposition is in Section B.2 of the supplementary material.

Figure 2: Synthetic Experiments: Comparison of MG-EUCB(+) and H-EUCB(+) to their respective offline solutions (G- and H-optimal, respectively) and to C-UCB (classical UCB). We use the following set up: (i) (see Supplement D for more extensive experiments) (ii) each state transition matrix associated with an arm was selected uniformly at random within the class of aperiodic and irreducible stochastic matrices; (iii) the reward for each arm, state pair is drawn i.i.d. from a distribution

belonging to either a Bernoulli, Uniform, or Beta distribution; (iv)

and .

4.2 Mg-Eucb Algorithm and Analysis

In the initialization phase, the algorithm computes and plays a sequence of matchings for a total of epochs. The initial matchings ensure that every edge in is selected at least once—the computation of these initial matchings relies on a greedy covering algorithm that is described in Section C.1 of the supplementary material. Following this, our algorithm maintains the cumulative empirical reward for every . At the beginning of (say) epoch , the algorithm computes a greedy matching for the instance where , i.e., the average empirical reward for the edge added to a suitably chosen confidence window. The incent function (Algorithm 4, described in the supplementary material since it is a trivial function) plays each edge in the greedy matching for iterations, where increases linearly with . This process is repeated for epochs. Prior to theoretically analyzing MG-EUCB, we return to Example 2.2 in order to provide intuition for how the algorithm overcomes correlated convergence of rewards.

1:procedure MG-EUCB(, )
2:, &
3: s.t.  see Supplement C.1 for details
4:incent() see Alg. 4 in Supplement C
5:for play each arm once
6: incent(, , , , )
8:end for
10: )
11: incent()
15:end while
16:end procedure
Algorithm 2 MatchGreedy-EpochUCB

Revisiting Example 1: Why does MG-EUCB work? In Example 1, the algorithm initially estimates the empirical reward of and to be zero respectively. However, during the UCB exploration phase, the matching is played again for epoch length and the state of agent moves from to during the epoch. Therefore, the algorithm estimates the average reward of each edge within the epoch to be , and the empirical reward increases. This continues as the epoch length increases, so that eventually the empirical reward for exceeds that of and the algorithm correctly identifies the optimal matching as we move from exploration to exploitation.

In order to characterize the regret of the MG-EUCB algorithm, Proposition 4.1 implies that it is sufficient to bound the expected number of epochs in which our algorithm selects each sub-optimal edge. The following theorem presents an upper bound on this quantity.

Consider a finite set of agents and incentives with corresponding aperiodic, irreducible Markov chains for each . Let be the MG-EUCB algorithm with mixing time sequence where , , and . Then for every ,

where , and is a constant specific to edge .

The full proof of the theorem is provided can be found in the supplementary material.


(sketch.) There are three key ingredients to the proof: (i) linearly increasing epoch lengths, (ii) overcoming cascading errors, and (iii) application of the Azuma-Hoeffding concentration inequality.

By increasing the epoch length linearly, MG-EUCB ensures that as the algorithm converges to the optimal matching, it also plays each arm for a longer duration within an epoch. This helps the algorithm to progressively discard sub-optimal arms without selecting them too many times when the epoch length is still small. At the same time, the epoch length is long enough to allow for sufficient mixing and separation between multiple near-optimal matchings. If we fix the epoch length as a constant, the resulting regret bounds are considerably worse because the agent states may never converge to the steady-state distributions.

To address cascading errors, we provide a useful characterization. For a given , suppose that refers to the average empirical reward obtained from edge up to epoch plus the upper confidence bound parameter, given that edge has been selected for exactly times in epochs to . For any given epoch where the algorithm selects a sub-optimal matching, i.e., , we can apply Lemma 3.2 and get that at least one of the following conditions must be true:

  1. [itemsep=-5pt,topsep=-2pt, leftmargin=15pt]

(a) Static Demand
(b) Random Demand
Figure 3: Bike-share Experiments: Figures (a)a and (b)b compare the efficiency (percentage of demand satisfied) of the bike-share system with two demand models under incentive matchings selected by MG-EUCB+ with upper and lower bounds given by the system performance when the incentives are computed via the benchmark greedy matching that uses the state information and when no incentives are offered respectively. In Figure (c)c we plot the mean reward of the MG-EUCB+ algorithm with static and random demand which gives the expected number of agents who accept an incentive within each epoch.

This is a particularly useful characterization because it maps the selection of each sub-optimal edge to a familiar condition that compares the empirical rewards to the stationary rewards. Therefore, once each arm is selected for epochs, the empirical rewards approach the ‘true’ rewards and our algorithm discards sub-optimal edges. Mathematically, this can be written as

where is some carefully chosen constant, and .

With this characterization, for each , we find an upper bound on the probability of the event . However, this is a non-trivial task since the reward obtained in any given epoch is not independent of the previous actions. Specifically, the underlying Markov process that generates the rewards is common across the edges connected to any given agent in the sense, that the initial distribution for each Markov chain that results from pulling an edge is the distribution at the end of the preceding pull. Therefore, we employ Azuma-Hoeffding (Azuma, 1967; Hoeffding, 1963), a concentration inequality that does not require independence in the arm-based observed rewards. Moreover, unlike the classical UCB analysis, the empirical reward can differ from the expected stationary reward due to the distributions and . To account for this additional error term, we use bounds on the convergence rates of Markov chains to guide the choice of the confidence parameter in Algorithm 2. Applying the Azuma-Hoeffding inequality, we can show that with high probability, the difference between the empirical reward and the stationary reward of edge is no larger than . ∎

As a direct consequence of Proposition 4.1 and Theorem 4.2, we get that for a fixed instance, the regret only increases logarithmically with .

5 Experiments

In this section, we present a set of illustrative experiments with our algorithm (MG-EUCB) on synthetic and real data. We observe much faster convergence with the greedy matching as compared to the Hungarian algorithm. Moreover, as is typical in the bandit literature (e.g., (Auer et al., 2002)), we show that a tuned version of our algorithm (MG-EUCB+), in which we reduce the coefficient on the term in the UCB ‘confidence parameter’ from six to three, further improves the convergence of our algorithm. Finally we show that our algorithm can be effectively used as an incentive design scheme to improve the performance of a bike-share system.

5.1 Synthetic Experiments

We first highlight the failure of classical UCB approaches (C-UCB)—e.g., as in (Gai et al., 2011)—for problems with correlated reward evolution. In Figure (a)a, we demonstrate that C-UCB converges almost immediately to a suboptimal solution, while this is not the case for our algorithm (MG-EUCB+). In Figure (b)b, we compare MG-EUCB and MG-EUCB+ with a variant of Algorithm 2 that uses the Hungarian method (H-EUCB) for matchings. While H-EUCB does have a ‘marginally’ higher mean reward, Figure (c)c reveals that the MG-EUCB and MG-EUCB+ algorithms converge much faster to the optimum solution of the greedy matching than the Hungarian alternatives.

5.2 Bike-Share Experiments

In this problem, we seek to incentivize participants in a bike-sharing system; our goal is to alter their intended destination in order to balance the spatial supply of available bikes appropriately and meet future user demand. We use data from the Boston-based bike-sharing service Hubway (hub, ) to construct the example. Formally, we consider matching each agent to an incentive , meaning the algorithm proposes that agent travel to station as opposed to its intended destination (potentially, for some monetary benefit). The agent’s state controls the probability of accepting the incentive by means of a distance threshold parameter and a parameter of a Bernouilli distribution, both of which are drawn uniformly at random. More details on the data and problem setup can be found in Section D of the supplementary material.

Our bike-share simulations presented in Figure 3 show approximately a % improvement in system performance when compared to an environment without incentives and convergence towards an upper bound on system performance. Moreover, our algorithm achieves this significant performance increase while on average matching less than % of users in the system to an incentive.

6 Conclusion

We combine ideas from greedy matching, the UCB multi-armed bandit strategy, and the theory of Markov chain mixing times to propose a bandit algorithm for matching incentives to users, whose preferences are unknown a priori and evolving dynamically in time, in a resource constrained environment. For this algorithm, we derive logarithmic gap-dependent regret bounds despite the additional technical challenges of cascading sub-optimality and correlated convergence. Finally, we demonstrate the empirical performance via examples.


This work is supported by NSF Awards CNS-1736582 and CNS-1656689. T. Fiez was also supported in part by an NDSEG Fellowship.


  • (1) Hubway: Metro-boston’s bikeshare program. [available online:].
  • Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2):235–256, May 2002. doi: 10.1023/A:1013689704352.
  • Azar et al. (2013) M. G. Azar, A. Lazaric, and E. Brunskill. Regret bounds for reinforcement learning with policy advice. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 97–112, 2013.
  • Azuma (1967) K. Azuma. Weighted sums of certain dependent random variables. Tohoku Math. J., 19(3):357–367, 1967. doi: 10.2748/tmj/1178243286.
  • Badanidiyuru et al. (2013) A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. In Proc. 54th Annual IEEE Symp. Foundations of Computer Science, pages 207–216, 2013.
  • Chen et al. (2016) W. Chen, Y. Wang, Y. Yuan, and Q. Wang. Combinatorial multi-armed bandit and its extension to probabilistically triggered arms. J. Machine Learning Research, 17:50:1–50:33, 2016. URL
  • Fill (1991) J. Fill. Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. Ann. Appl. Probab., 1(1):62–87, 1991.
  • Folland (2007) G. Folland. Real Analysis. Wiley, 2nd edition, 2007.
  • Gai et al. (2011) Y. Gai, B. Krishnamachari, and M. Liu. On the combinatorial multi-armed bandit problem with markovian rewards. In Proc. Global Communications Conf., pages 1–6, 2011. doi: 10.1109/GLOCOM.2011.6134244.
  • Garey and Johnson (1979) M. R. Garey and David S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. ISBN 0-7167-1044-7.
  • Ghosh and Hummel (2013) A. Ghosh and P. Hummel. Learning and incentives in user-generated content: multi-armed bandits with endogenous arms. In Proc. of ITCS 2013, pages 233–246, 2013.
  • Hoeffding (1963) W. Hoeffding. Probability inequalities for sums of bounded random variables. J. American Statistical Association, 58(301):13–30, 1963. doi: 10.2307/2282952.
  • Immorlica et al. (2015) Nicole Immorlica, Gregory Stoddard, and Vasilis Syrgkanis. Social status and badge design. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, pages 473–483, 2015.
  • Jain et al. (2014) S. Jain, B. Narayanaswamy, and Y. Narahari. A multiarmed bandit incentive mechanism for crowdsourcing demand response in smart grids. In Proc. of AAAI 2014, pages 721–727, 2014.
  • Jaksch et al. (2010) T. Jaksch, R. Ortner, and P. Auer. Near-optimal Regret Bounds for Reinforcement Learning. J. Machine Learning Research, 11:1563–1600, 2010.
  • Kuhn (1955) H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics, 2(1-2):83–97, 1955.
  • Kveton et al. (2014) B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial optimization with learning. In Proc. of UAI 2014, pages 420–429, 2014.
  • Kveton et al. (2015) Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds for stochastic combinatorial semi-bandits. In Artificial Intelligence and Statistics, pages 535–543, 2015.
  • Levin et al. (2009) D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2009.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proc. 19th Intern. Conf. World Wide Web, pages 661–670, 2010.
  • Lu et al. (2010) T. Lu, D. Pál, and M. Pál. Contextual multi-armed bandits. In Proc. of AISTATS 2010, pages 485–492, 2010.
  • Mazumdar et al. (2017) E. Mazumdar, R. Dong, V. Rúbies Royo, C. Tomlin, and S. S. Sastry. A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes. arxiv:1707.05714, 2017.
  • Mehta and Mirrokni (2011) A. Mehta and V. Mirrokni. Online ad serving: Theory and practice, 2011.
  • Ratliff et al. (2018) L. J. Ratliff, S. Sekar, L. Zheng, and T. Fiez. Incentives in the dark: Multi-armed bandits for evolving users with unknown type. arxiv, 2018.
  • Sani et al. (2012) Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Proc. of NIPS 2012, pages 3284–3292, 2012.
  • Scott (2015) S. L. Scott. Multi-armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
  • Singla et al. (2015) A. Singla, M. Santoni, G. Bartók, P. Mukerji, M. Meenen, and Andreas Krause. Incentivizing users for balancing bike sharing systems. In Proc. of AAAI 2015, pages 723–729, 2015.
  • Tekin and Liu (2010) Cem Tekin and Mingyan Liu. Online algorithms for the multi-armed bandit problem with markovian rewards. In Communication, Control, and Computing (Allerton), 2010 48th Annual Allerton Conference on, pages 1675–1682. IEEE, 2010.
  • Tekin and Liu (2012) Cem Tekin and Mingyan Liu. Online Learning of Rested and Restless Bandits. IEEE Transactions on Information Theory, 58(8):5588–5611, 2012.
  • Tran-Thanh et al. (2014) L. Tran-Thanh, S. Stein, A. Rogers, and N. R. Jennings. Efficient crowdsourcing of unknown experts using bounded multi-armed bandits. Artif. Intell., 214:89–111, 2014.
  • Wen et al. (2015) Zheng Wen, Branislav Kveton, and Azin Ashkan. Efficient learning in large-scale combinatorial semi-bandits. In International Conference on Machine Learning, pages 1113–1122, 2015.

Appendix A Notational Table

notation meaning
set of agents
set of incentives
allowed agent-incentive pairs
state (type) space of agent
transition probability kernel
agent ’s type distribution at epoch
stationary distribution of
expected reward from
number of iterations matching offered
in epoch , ,
random reward
agent ’s reward distribution
time-averaged reward during epoch
maximum number of edges of class
greedy matching on weights
the edge having the –th
largest weight in .
incentive agent is matched to in
set of that become infeasi-
ble when is added to matching
but not before that
set of edges such that
number of agents & incentives
the total number of epochs
state of agent at the beginning
of epoch
constants specific to each edge
regret of given matching policy
at the end of epochs
number of times edge
selected in first epochs
reward on edge when selected
for the –th time given
average reward on first times
is selected, i.e.,
agent ’s state at the beginning
of epoch