Decentralized Competing Bandits in Non-Stationary Matching Markets

by   Avishek Ghosh, et al.

Understanding complex dynamics of two-sided online matching markets, where the demand-side agents compete to match with the supply-side (arms), has recently received substantial interest. To that end, in this paper, we introduce the framework of decentralized two-sided matching market under non stationary (dynamic) environments. We adhere to the serial dictatorship setting, where the demand-side agents have unknown and different preferences over the supply-side (arms), but the arms have fixed and known preference over the agents. We propose and analyze a decentralized and asynchronous learning algorithm, namely Decentralized Non-stationary Competing Bandits (), where the agents play (restrictive) successive elimination type learning algorithms to learn their preference over the arms. The complexity in understanding such a system stems from the fact that the competing bandits choose their actions in an asynchronous fashion, and the lower ranked agents only get to learn from a set of arms, not dominated by the higher ranked agents, which leads to forced exploration. With carefully defined complexity parameters, we characterize this forced exploration and obtain sub-linear (logarithmic) regret of . Furthermore, we validate our theoretical findings via experiments.


page 1

page 2

page 3

page 4


Dominate or Delete: Decentralized Competing Bandits with Uniform Valuation

We study regret minimization problems in a two-sided matching market whe...

Beyond log^2(T) Regret for Decentralized Bandits in Matching Markets

We design decentralized algorithms for regret minimization in the two-si...

Decentralized, Communication- and Coordination-free Learning in Structured Matching Markets

We study the problem of online learning in competitive settings in the c...

Competing Bandits in Matching Markets

Stable matching, a classical model for two-sided markets, has long been ...

Using Non-Stationary Bandits for Learning in Repeated Cournot Games with Non-Stationary Demand

Many past attempts at modeling repeated Cournot games assume that demand...

Learning the Trading Algorithm in Simulated Markets with Non-stationary Continuum Bandits

The basic Multi-Armed Bandits (MABs) problem is trying to maximize the r...

Formally Verified Trades in Financial Markets

We introduce a formal framework for analyzing trades in financial market...

1 Introduction

Repeated decision making by multiple agents in a competitive and uncertain environment is a key characteristic of modern day, two sided markets, e.g., TaskRabbit, UpWork, DoorDash, etc. Agents often act in a decentralized fashion on these platforms, and understanding the induced dynamics is an important step before designing policies around how to operate such platforms to maximize various system objectives such as revenue, efficiency and equity of allocations (Johari et al., 2021; Liu et al., 2020). A body of recent work is aimed at understanding the decentralized learning dynamics in such matching markets Sankararaman et al. (2021); Liu et al. (2020, 2021); Dai and Jordan (2021a, b); Basu et al. (2021). This line of work studies the matching markets introduced first by the seminal work of Gale and Shapley (1962), under the assumption where the participants are not aware of their preference and learn it over time by participating in the market. A key assumption made in these studies is that the true preferences of the market participants are static over time, and thus can be learnt with repeated interactions.

Markets, however are seldom stationary and continuously evolving. Indeed, an active area of research in management sciences and operations research revolve around understanding the equilibrium properties in such evolving markets Damiano and Lam (2005); Akbarpour et al. (2020); Kurino (2020); Johari et al. (2021). However, a central premise in this line of work is that the participants have exact knowledge over their preferences, and only need to optimize over other agents’ competitive behaviour and future changes that may occur. In this work, we take a step towards bridging the two aforementioned lines of work. To be precise, we study the learning dynamics in markets where both the participants do not know their exact preferences and the unknown preferences are themselves smoothly varying over time.

Conceptually, the seemingly simple addition of varying preferences invalidates the core premise of learning algorithms in a stationary environment (such as those in (Liu et al., 2021; Sankararaman et al., 2021)

) where learning is guaranteed to get better with time as more samples can potentially be collected. In a dynamic environment, agents need to additionally trade-off collecting more samples by competing with other agents to have a refined estimate, with the possibility that the quantity to be estimated being stale and thus not meaningful.

Model Overview: The model we study consists of agents and resources or arms, where the agents repeatedly make decisions of which arm to match with over a time horizon of . The agents are globally ranked from through . The agents are initially assumed to not know their rank. In each round, every agent chooses one of the arms to match with. Every arm that has one or more agents requesting for a match, allocates itself to the highest ranking agent requesting a match111If , then agent with rank is said to be higher ranked than agent , while blocking all other requesting agents. If at time , agent is matched to arm , then agent sees a random reward independent of everything else with mean . The agents that are blocked are notified of being blocked and receive reward. Moreover the agents are decentralized, i.e., make decisions on which arm to match is a function of the history of the arms chosen, arms matched and rewards obtained at that agent.

The key departure from prior works of Liu et al. (2020, 2021); Sankararaman et al. (2021) is that the unknown arm-means between any agent and arm is time-varying, i.e., the mean is dependent on time . We call our model smoothly varying, because we impose the constraint that for all agents and arms , and time , , for some known parameter . However, we make no assumptions on the synchronicity of the markets, i.e., the environments of different agents can change arbitrarily with the only constraint that any arm-agent pair means does not change by more than in one time-step.

Why is this model challenging ? Even in the single agent case without competitions, algorithms such as UCB Auer et al. (2002) perform poorly compared to algorithms such as SnoozeIT Krishnamurthy and Gopalan (2021) that adapts to the varying arm means (c.f. Figure (a)a) in smoothly varying environments. The reason is that stationary algorithms such as UCB weighs all the samples collected thus far equally in identifying which arm to pull, while adaptive algorithms such as SnoozeIt weighs recent samples more than older samples in order to estimate the arm-mean at the current time point. This is exacerbated in a multi-agent competitive setup where agents need to decide whether to pull an arm that yielded good results in the past, but is facing higher competition at the present.

We circumvent this problem by introducing the idea of forced exploration. Since the environments across agents are time-varying possibly asynchronously, a lower ranked agent may be forced to explore and obtain linear regret, if any of the higher ranked agents are exploring. To build intuition, consider a agent system in which the higher ranked agent is called Agent 1, and the other agent is Agent 2. Suppose, Agent ’s environment (i.e., arm-means) are volatile where the gap between the best and second best arm is small, while Agent has a more benign environment, where all arm-means are well separated and not varying with time. In this case, Agent will be forced to explore arms a lot as its environment is fluctuating with no clear best arm emerging. Since any collision implies that Agent will not receive a reward, Agent is also forced to explore and play sub-optimal arms to evade collision, even when it knows its own best arms. This phenomenon indeed also occurs in the stationary setting, albeit in the stationary setting, every agent knows that after an initial exploration time, all agents will “settle” down and find their best arm. This is the concept of freezing time in Sankararaman et al. (2021); Basu et al. (2021). In the dynamic setting however, the forced explorations can keep occurring repeatedly over time, as the agents environment changes.

1.1 Our Contributions

1.1.1 Algorithms

We introduce a learning algorithm, DNCB, in which agents proceed in phases with asynchronous start and end-points, wherein in each phase, agents explore among those arms that are not currently preferred by higher-ranked agents, and subsequently exploit a good arm, for a dynamic duration of time in which the estimated best arm can remain to be optimal. The main algorithmic innovation is to identify that the static synchronous arm-deletion strategy of UCB-D3 Sankararaman et al. (2021), can be coupled with SnoozeIt to yield a dynamic, asynchronous explore-exploit type algorithm for non-stationary bandits.

1.1.2 Technical novelty

In order to analyze and prove that DNCB yields good regret guarantees, we introduced this notion of forced exploration. Roughly speaking, this is the regret incurred due to exploration of an agent, when the higher ranked agents are exploring. This extra regret is a consequence of the serial-dictatorship (which we define in Section 3), whereby agents can incur collision and do not get any reward. Although agents in the stationary setting also incur forced exploration, its effect is bounded since every agent can eventually guarantee that the best arm can be learnt. However, in an asynchronously varying environment, bounding this term is non-trivial. We circumvent this by decomposing the forced exploration of an agent recursively; an agent ranked effectively explores if either its own environment is fluctuating and thus hard to identify its best arm, or if the agent ranked is effectively exploring. We leverage this to recursively bound the regret of agent ranked as a function of agent ranked . Unravelling this recursion yields the final regret.

1.1.3 Experiments

We empirically validate our algorithms to demonstrate that it (i) is simple to implement and (ii) the results match the theoretical insights, such as agents incurring additional regret due to forced explorations.

One criticism to our model is that the agents are aware of the parameter , which is used in the algorithm. We however argue that even in the presence of this known parameter, designing decentralized algorithms is challenging and requires several technical novelty. Parameter free algorithms that do not require any knowledge of is unknown even for the single agent bandit problem Krishnamurthy and Gopalan (2021). Designing parameter free algorithms in the multi-agent case is more challenging and is left to future work.

2 Related work

Bandits and Matching Markets

Bandits and matching markets have received a lot of attention lately, owing to both their mathematical non-triviality and the enormous practical impact they hold. Regret minimization in matching markets was first introduced in Liu et al. (2020) which studied the much simpler problem of stationary markets under a centralized scheme, where a central entity matches agents with arms at each time. They showed that under this policy, a learning algorithm can get per-agent regret scaling as . Subsequently, Sankararaman et al. (2021) studied the decentralized version of the problem under the serial dictatorship and proposed the UCB-D3 algorithm that achieved per-agent regret. Subsequently, Liu et al. (2021) proposed CA-UCB, a fully decentralized algorithm that could achieve per-agent regret in the general decentralized stationary markets. Matching markets has been an active area of study in combinatorics and theoretical computer science due to the algebraic structures they present Pittel (1989); Roth and Vate (1990); Knuth (1997). However, these works consider the equilibrium structure and not the learning dynamics induced when participants do not know their preferences.

Non-Stationary Bandits

The framework on non stationary bandits were introduced in Whittle (1988) with restless bandits. There has been a line of interesting work in this domain–for example in Garivier and Moulines (2011); Auer et al. (2019); Liu et al. (2018) the abruptly changing setup is analyzed, and change point based detection methods were employed. Furthermore, in Besbes et al. (2014), a total variation budgeted setting is considered, where the total amount of (temporal) variation is known. On the other hand, Wei and Srivatsva (2018); Krishnamurthy and Gopalan (2021) focuses on the smoothly varying non-stationary environment. Note that Wei and Srivatsva (2018) modify the sliding window UCB algorithm of Garivier and Moulines (2011) and employ windows of growing size. On the other hand, very recently Krishnamurthy and Gopalan (2021) analyzed the smoothly varying framework by designing windows of dynamic length and test for optimality within a sliding window.


For a positive integer , we denote the set by . Moreover, For integers, , the notation implies the remainder (modulo) operation.

3 Problem Setup

We consider the standard setup with agents and arms, with . At time , every agent has a ranking of the arms, which is dictated by the arm means . On the other hand, it is assumed that the agents are ranked homogeneously for all the arms, and the ranking is known to the arms. This is called the serial dictatorship model, is a well studied model in the market economy (see Abdulkadiroğlu and Sönmez (1998); Sankararaman et al. (2021)), and without loss of generality, it is assumed that the rank of agent is . We say agent is matched to arm at time , if agent pulls and receives (non zero) reward from arm . Our goal here is to find the unique stable matching (uniqueness ensured by the serial dictatorship model) between the agents and the arm side in a non-stationary (dynamic) environment. We consider the smooth varying framework of Wei and Srivatsva (2018); Krishnamurthy and Gopalan (2021) to model the non-stationary, which assumes for all , and the maximum drift is .

We write as the arm preferred by the the Agent ranked at time , i.e., . Similarly, for Agent ranked , the preferred arm is given by . So, we see that forms a stable match, and so does for . Let be the arm played by an algorithm . The regret of agent playing algorithm upto time is given by where indicates whether arm is matched.

4 Warm-up: Dncb with agents

  Input: Horizon , drift limit
  Initialize set of tuples , , Initialize episode index ,
  for  do
     Pull-Arm by Agent 1:
     if , pull arm (round robbin) (Explore); if s.t. , play arm (Exploit)
     Test by Agent 1:if arm and s.t. for then, ,
      if , , Updates black-board with
      else , ,  
     Release arm by Agent 1:
      if then , release arm
     Pull Arm by Agent 2:
     Case I: if Agent 1 is not committed, pull arm (round robbin on ) (Explore ALL)
     Case II: if Agent 1 is committed to arm and , pull -th smallest arm id in (round robbin on ) (Explore-j)
     Case III: if Agent 1 is committed to arm , and s.t. , play arm (Exploit) if then, ,
     Test by Agent 2:
     for  s.t. Agent 2 is in Explore-j or Explore ALL do
        if  arm and s.t. for  then
           , ; define
            if , Update
        end if
     end for
     Release arms for Agent 2:
     for  do
         if then,
     end for
  end for
Algorithm 1 DNCB with

We now propose and analyze the algorithm, Decentralized Non-stationary Competing Bandits (DNCB) to handle the competitive nature of a market framework under a smoothly varying non-stationary environment. To understand the algorithm better, we first present the setup with agents and arms, and then in Section 5, we generalize this to agents.

We consider , since it is the simplest non-trivial setup to gain intuition about the complexity of the competitive nature of DNCB algorithm. Without loss of generality, assume that agent has rank , where . So, in the above setup, Agent 1 is the highest ranked agent.

Black Board model:

Moreover, to begin with and for simplicity, we assume a black-board model, and later in Section 6

, remove the necessity of this black board. We emphasize that black-board model of communication is quite standard in centralized multi-agent systems, with applications in game theory, distributed learning and auction applications

Awerbuch and Kleinberg (2008); Buccapatnam et al. (2015); Agarwal et al. (2012). Through this black-board, the agents can communicate to one other. This is equivalent to broadcasting in the centralized framework.

The learning algorithm is presented in Algorithm 1. The algorithm runs over several episodes indexed by and for Agent 1 and 2 respectively.

Rank Estimation ():

We let both agents pull arm 1 in the first time slot. Agent 1, will see a (non-zero) reward, and hence estimates its rank to be 1. The other agent, will see a 0 reward, so it estimates its rank as 2.

Agent 1:

Since Agent 1 is highest ranked agent, it does not face any collision. It plays the well-known and standard Successive Elimination (SE) type algorithm (see Slivkins (2019)). As mentioned in Section 1, we use a variation of SnoozeIT algorithm of Krishnamurthy and Gopalan (2021) with arms. Specifically, it (a) first explores to identify if there is a best arm and (b) if it finds a best arm, it commits to that for some amount of time. Note that with non-stationary environment, Agent 1 needs to repeat this procedure over time. In Figure 3, we consider one episode of Agent 1, where the yellow segments indicate the exploration time, and at the end of that, the purple segment indicates the commit (exploitation) (to say arm ) time. Furthermore, when Agent 1 commits, it writes the arm on which it is committing and the duration of the commit to the black-board, so that Agent 2 can accordingly choose actions from a restricted set of arms to avoid collision. Note that, there is no competition here, and the (interesting) market aspect is absent.

We now define an optimality test via which Agent 1 (and 2) decides to commit. Let denote the empirical reward mean of arm at time , based on its last pulls.

Definition 1

(-optimality) At time , an arm is said to be optimal with respect to set , if , for all , where , and .

Since Agent 1 faces no competition, (the set of all arms), but will be different for Agent 2, as we will see shortly. In Algorithm 1, we denote as the duration of the exploration period before the test succeeds (with at episode , and we use

to denote the starting of epochs. After the test, the agent exploits the best arm for

time, and then releases it. We define the set to determine whether Agent 1 should commit or continue exploring.

Figure 3: Action of Agents 1 and 2 in a matching markets
Agent 2:

The actions of Agent 2 borne out the competition (market) aspect of the problem, as seen in Figure 3. We now explain its different phases:

Explore ALL: Here, Agent 2 explores all the arms, i.e., plays in a round robbin fashion within the arm-set . This is shown in light green in Figure 3. This happens when Agent 1 is also exploring and has not committed yet.

Explore-j: This is shown in dark-green in Figure 3. Here, Agent 2 explores within the set . In the figure, . This is done to avoid collision for Agent 2, since we know that when Agent 1 commits to arm , Agent 2 will get reward while pulling , and hence it is in its best interest to explore all but .

Forced Exploration: Consider Scenario 3 of Figure 3(b). Here, Agent 2 has decided to commit on an arm before Agent 1. However, it cannot start to exploit since Agent 1 is still exploring. Otherwise, it will periodically face collisions (and get reward, hence incurring linear regret in this duration). This is the additional exploration faced by Agent 2, which we term as forced exploration (shown in blue in Figure 3(b)). In Theorem 1, we characterize the regret stems from this forced exploration.

Exploit: Observe that Agent 2 gets only gets to commit when Agent 1 has committed already, on the arm set . There is another caveat here. We also restrict Agent 2 to end its exploitation as soon as the exploitation of Agent 1 ends. The reasoning is same—Agent 1 starts exploring right after its exploitation and Agent 2 must release the arm it was exploiting to avoid collision. Note that this also results in higher regret of Agent 2, as it does not get to fully exploit the arm it was committed to.

In Algorithm 1, we denote as the time instances where an epoch starts for Agent 2. We denote by to be the set of arms from which agent plays in phase . Observe from Figure 3 that in any given phase of Agent , the set of arms it plays from is fixed. Moreover, we use the notation to denote the state of Agent 2, and as explained in Algorithm 1, , where the terms are explained above. Furthermore, we define as the duration of the exploration period before the test succeeds with in epoch . We introduce to ensure that the exploitation of Agent 2 expires as soon as the exploitation ends for Agent 1.

Saving extra exploration: Note that Agent 2 continues to test for an optimal arm even when Agent 1 is exploring. It might seem to be wasteful at first since it cannot commit immediately. However, Agent 2 constructs the sets , which denote the exploitation period of Agent 2, without arm in the system. This is useful because, as soon as Agent 1 commits to arm and is non-empty, Agent 2 gets to commit leveraging this test. This saves extra exploration for Agent 2 and hence reduces regret.

4.1 Problem Complexity—Dynamic Gap

We define the (dynamic) gap, denoted by for agent , which determines how complex the problem is. This is expressed as an average gap over a local window.

Definition 2

For , we define the dynamic gap on a dominated set as,

and if such a does not exist, we set . Here, , and . For shorthand, if , we denote . Here and are universal constants.

Remark 1

The dominated dynamic gap is a strict generalization of the usual window based average gap used in non-stationary bandits. We introduce a dominated set , for the competitive market setting, since the actions of lower ranked agents are dominated by that of higher ranked ones.

4.2 Regret Guarantee

Theorem 1 (2 Agent Dncb)

Suppose we run Algorithm 1 with Agents upto horizon with drift . Then the expected regret for Agent 1 is and for Agent 2 is

where the horizon is divided into blocks, each having length at most . Here and denote the dynamic gap of the problem over an entire -th block.


Regret of Agent 1 matches Krishnamurthy and Gopalan (2021): Observe that the regret of Agent 1 matches exactly to Snooze-IT. Since Agent 1 faces no collision, we were able to recover the regret of Snooze-IT.

Regret of Agent 2: The regret of Agent 2 has 3 components. The first term, comes from the Explore-ALL. In this phase, Agent 2 explores all arms and the regret is similar to Agent 1.

The second term in regret, originates from the Forced Exploration of Agent 2. Note that this depends on the complexity (gap) of Agent 1. This validates our intuition, because, when Agent 1’s environment is complex, it takes more exploration for Agent 1, and as a result Agent 2 faces additional forced exploration. This is a manifestation of the market structure, since the regret of Agent 2 is influenced by that of higher ranked agent.

The third term in the regret expression comes from Explore-j phase, where Agent 1 is committed on arm . Observe that here, the dominated gap naturally comes into the picture. The pre-factor of appears for the following reason. We design the blocks in such a way that each block contains at most 2 phases of Agent 1. Moreover, we show that the number of epochs for Agent 2 in one exploitation phase of Agent 1 is at most .

Regret matches to UCB-D3 of Sankararaman et al. (2021) in stationary setup: We compare the regret of DNCB with that of the non-stationary UCB-D3 of Sankararaman et al. (2021). In the stationary environment (), the definition of gap is invariant with time. For Agent 2, from (Sankararaman et al., 2021, Corollary 2), we obtain the regret to be , where is the stationary dominated gap. Note that this is exactly same as Theorem 1 (except for a mildly worse dependence on ). Hence, we recover the order-wise optimal regret in the stationary setting.

5 Dncb Algorithm with competing agents

In this section, we extend DNCB for agents. We stick to the setup where the -th Agent is ranked and focus on the learning algorithm of the -th agent. Let us fix some notation. We denote as the arm committed by Agent at time . For Agent , we (sequentially) define as the set of committed (dominated) arms by agents ranked higher that . The learning scheme is presented in Algorithm 2

  Input: Horizon , drift limit
  Initialize for all , and , Initialize ,
  for  do
     Update State : if then, else if s.t. , else if then
     Pull-Arm by Agent r:
     Case I: if , pull smallest arm in (Play round robbin with
     Case II: if , then play arm .
     Test by Agent :
     for  s.t. and Agent is in Explore-  do
        if  and s.t. for  then
           , , define
            if , then , else
        end if
     end for
     Update Black Board:
     if s.t. , write on the board
  end for
Algorithm 2 DNCB for -th Agent

RANK ESTIMATION() We start with the rank estimation which takes time steps. At , all agents pull arm . Subsequently, for , agents, never matched to any arms play arm , and the agents who were matched to an arm, continues to play the matched arm. By inductive reasoning, one can observe that this collision routine ensures that all agents know their own rank.

We denote as the start epochs for Agent . To identify the state of Agent we define , where in , the -th agent plays in a round round robbin on the set of arms, and in it pulls arm indexed by .

At any time , Agent first looks at the black-board, and using the information, it constructs a dominated set , which contains all the committed arms from Agents 1 to . Based on , Agent updates to reflect whether it is in Explore- phase, or in the exploit phase. In particular, Agent gets to commit on an arm in , if all the higher ranked agents have already committed, i.e., . A new phase is spawned for Agent if either the dominated set changes, or its own phase ends. Both this cases are captured by , and hence, based on whether changes or not, Agent decides to start a new phase.

The test procedure of Agent is similar to that of Agent 2, with a difference that Agent tests in the arms in the subset , and hence the buffer length is accordingly designed. We also need to ensure that Agent ends its exploitation phase when any higher ranked agent starts exploring. This is ensured by defining .

Saving extra exploration: Furthermore, Agent constructs the sets for all with . As explained in the 2 agent case, this saves extra exploration for Agent , because if the statistical test succeeds on an arm , and there exists , with such that is non-empty, Agent immediately commits to arm .

5.1 Regret Guarantee

We characterize the regret of -th agent, with . Note that the regret of Agent will be identical as Theorem 1, since it faces no competition and hence no collision.

The regret of -th Agent will depend on the dynamic gap of Agents 1 to , and hence to ease, we define

Note that the definitions of and are generalizations of , with further restrictions on the dominated set . With this, we have the following result:

Theorem 2 ( agent Dncb)

Suppose we run Algorithm 2 for agents with drift. The regret for -th ranked agent is given by

where we divide the horizon in blocks, having at most length. Here and denotes the gap for -th block.

Discussion: The performance of Agent depends crucially on Agent , and based on whether Agent is exploring or exploiting, the regret depends on the higher ranked agents. Hence, the dynamic gap depending on both and sneaks in the regret expression via and .

Special case, : When in Theorem 2, we exactly get back the regret of Agent 2 in Theorem 1. So, for Agent 2, there is no additional cost for extending DNCB to agents.

Different terms: Similar to the 2 agent case, the first term presents the regret from exploration of Agent , when Agent is exploring. Hence, the size of the dominated set is at most . Similarly, the second term corresponds to the forced exploration of Agent . Note that this depends on how complex the system of Agent is. Furthermore, the third term corresponds to the regret when Agent has committed, and hence the size of dominated set is at most . These are characterized by and .

Note that we characterize the regret of Agent by focusing on one phase of Agent (similar to the 2 agent case), and we show that the number of epochs of Agent in one block is at most , which causes the multiplicative pre-factor. Note that with , the factor is absent, since the blocks are designed to contain at most 2 epochs of Agent 1.

Matches UCB-D3 of Sankararaman et al. (2021) in stationary setup: Note that, in the stationary setup (), the regret expression in Theorem 2 matches to that of UCB-D3 (except a mildly weak dependence on ), which is shown to be order optimal. So, DNCB recovers the optimal regret in the stationary case.

Figure 9: In , we compare SnoozeIT and UCB with . In and DNCB on a system with 3 agents and 4 arms is simulated, and the same systems are comapred with UCB-D3 in and

6 Dncb without Black Board

Upto now, we present DNCB with a black board, via which the agents communicate among themselves. In this section, we remove this, and obtain the same information via collision. We emphasize that without the black board, the learning algorithm is completely decentralized.

6.0.1 Special case: Black board removal with

In the presence of the black board, Agent 2 knows whether Agent 1 is exploring (on all arms) or committed to an particular arm. The same information can be gathered from a collision. Agent 2 maintains a latent variable , which indicates whether Agent 1 is in Explore or Exploit phase. At the beginning, .

If at round , Agent 2 faces a collision on arm , one of two things can happen—(a) Agent 1 has (ended exploring and) committed to arm or (b) Agent 1 (has ended its exploitation and) is exploring. This is true from the design of DNCB. After a collision, Agent 2 looks at . If , then case (a) has happened and if , then case (b) has happened. So, just toggling the variable is enough for Agent 2 to keep track of Agent 1. It is easy to see that, from the round robbin structure of exploration, that after Agent’s 1 phase changes, it may take upto time steps for a collision to take place.

Lemma 1 (Regret Guarantee)

Suppose , for a constant . Then, for a shifted system, DNCB without blackboard satisfies the regret guarantees identical to that of Theorem 1 (with ).

For a shifted system, upto time , the maximum total shift is , and hence with , we ensure that the system remains stationary in these time steps. We emphasize that DNCB is an asynchronous algorithm, and hence, without black board, we require an even slower varying system to maintain stationarity.

6.0.2 Black board removal with agents

DNCB is an asynchronous algorithm, and hence establishing coordination between agents is quite non-trivial. In previous works, such as Sankararaman et al. (2021), the learning includes a fixed set of time slots for communication among agents. This coordinated communication can not be done for DNCB, since the phases start and end at random times. Hence, to handle this problem, we consider a slightly stronger reward model.

Reward model: To ease communication across agents, we assume that in case of collision, the reward is given to the highest ranked agent, and all the remaining agent gets zero reward, as well as the index of the agent who gets the (non-zero) reward. We remark that this side information is not impratical in applications like college admissions, job markets etc., and this exact reward model is also studied in Liu et al. (2021).

Under this new reward model, Agent maintains a set of latent variables, for all , where . If at time , Agent experiences a collision, and the reward goes to an Agent , with , then Agent toggles . In this way, after a collision on arm , Agent knows that either Agent has committed on arm or it is exploring on a set of arms including —and based on , Agent knows which event has happened exactly. From the round robbin nature of exploration, detecting this may take at most steps.

Lemma 2 (Regret Guarantee)

Suppose , for a constant . For a shifted system, DNCB without blackboard satisfies the regret guarantees identical to that of Theorem 2 (with ).

The above remark holds under the modified and stronger reward model. Design of an efficient coordination protocol in an asynchronous system is left to future work.

7 Simulations

In Figure 9, we demonstrate the effectiveness of DNCB on synthetic data. In Figure (a)a, we observe that when the environment is varying, SnoozeIt outperforms vanilla UCB algorithm. In Figures (b)b and (c)c, we simulate DNCB on two instances with different arm-means and dynamics. We can observe from these plots that the exploitation time of agent is strictly within that of agent , and similarly that of agent is strictly within that of agent . This visually captures the notion of forced explorations, where an agent can only exploit arms, when all higher ranked agents are themselves exploiting arms.

In Figures (d)d and (e)e, we compare the performance of DNCB with that of UCB-D3 in a dynamic environment. We find that although the performance of agent is similar in the two systems, the performance of the lower ranked agents are much superior in DNCB compared to UCB-D3. This shows that DNCB is sensitive to the potential variations in arm-means and helps all agents adapt faster compared to UCB-D3 which is designed assuming the environment is stationary. The exact details on the experiment setup is given in the Supplementary materials in Section D.

8 Conclusion and open problems

We introduced the problem of decentralized, online learning in two-sided markets when the underlying preferences vary smoothly over time. This paper however leaves several intriguing open problems: (a) to understand whether the assumption of known be relaxed; (b) extend the dynamic framework to general markets beyond serial dictatorship; and (c) to consider other forms of non-stationary such as piece-wise constant markets or variations with a total budget constraint.


  • Abdulkadiroğlu and Sönmez (1998) A. Abdulkadiroğlu and T. Sönmez. Random serial dictatorship and the core from random endowments in house allocation problems. Econometrica, 66(3):689–701, 1998.
  • Agarwal et al. (2012) A. Agarwal, P. L. Bartlett, P. Ravikumar, and M. J. Wainwright. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Transactions on Information Theory, 58(5):3235–3249, 2012.
  • Akbarpour et al. (2020) M. Akbarpour, S. Li, and S. O. Gharan. Thickness and information in dynamic matching markets. Journal of Political Economy, 128(3):783–815, 2020.
  • Auer et al. (2002) P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  • Auer et al. (2019) P. Auer, P. Gajane, and R. Ortner. Adaptively tracking the best bandit arm with an unknown number of distribution changes. In A. Beygelzimer and D. Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 138–158. PMLR, 25–28 Jun 2019. URL
  • Awerbuch and Kleinberg (2008) B. Awerbuch and R. Kleinberg. Competitive collaborative learning. Journal of Computer and System Sciences, 74(8):1271–1288, 2008.
  • Basu et al. (2021) S. Basu, K. A. Sankararaman, and A. Sankararaman. Beyond regret for decentralized bandits in matching markets. In M. Meila and T. Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 705–715. PMLR, 18–24 Jul 2021. URL
  • Besbes et al. (2014) O. Besbes, Y. Gur, and A. Zeevi. Stochastic multi-armed-bandit problem with non-stationary rewards. Advances in neural information processing systems, 27:199–207, 2014.
  • Buccapatnam et al. (2015) S. Buccapatnam, J. Tan, and L. Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605–2613. IEEE, 2015.
  • Dai and Jordan (2021a) X. Dai and M. I. Jordan. Learning strategies in decentralized matching markets under uncertain preferences. Journal of Machine Learning Research, 22(260):1–50, 2021a.
  • Dai and Jordan (2021b) X. Dai and M. I. Jordan. Multi-stage decentralized matching markets: Uncertain preferences and strategic behaviors. arXiv preprint arXiv:2102.06988, 2021b.
  • Damiano and Lam (2005) E. Damiano and R. Lam. Stability in dynamic matching markets. Games and Economic Behavior, 52(1):34–53, 2005.
  • Gale and Shapley (1962) D. Gale and L. S. Shapley. College admissions and the stability of marriage. The American Mathematical Monthly, 69(1):9–15, 1962.
  • Garivier and Moulines (2011) A. Garivier and E. Moulines. On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory, pages 174–188. Springer, 2011.
  • Johari et al. (2021) R. Johari, V. Kamble, and Y. Kanoria. Matching while learning. Operations Research, 69(2):655–681, 2021.
  • Karnin and Anava (2016) Z. S. Karnin and O. Anava. Multi-armed bandits: Competing with optimal sequences. Advances in Neural Information Processing Systems, 29:199–207, 2016.
  • Knuth (1997) D. E. Knuth. Stable marriage and its relation to other combinatorial problems: An introduction to the mathematical analysis of algorithms, volume 10. American Mathematical Soc., 1997.
  • Krishnamurthy and Gopalan (2021) R. Krishnamurthy and A. Gopalan. On slowly-varying non-stationary bandits. arXiv preprint arXiv:2110.12916, 2021.
  • Kurino (2020) M. Kurino. Credibility, efficiency, and stability: A theory of dynamic matching markets. The Japanese Economic Review, 71(1):135–165, 2020.
  • Lattimore and Szepesvári (2020) T. Lattimore and C. Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Liu et al. (2018) F. Liu, J. Lee, and N. Shroff. A change-detection based framework for piecewise-stationary multi-armed bandit problem. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 32, 2018.
  • Liu et al. (2020) L. T. Liu, H. Mania, and M. Jordan. Competing bandits in matching markets. In International Conference on Artificial Intelligence and Statistics, pages 1618–1628. PMLR, 2020.
  • Liu et al. (2021) L. T. Liu, F. Ruan, H. Mania, and M. I. Jordan. Bandit learning in decentralized matching markets. Journal of Machine Learning Research, 22(211):1–34, 2021.
  • Luo et al. (2018) H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pages 1739–1776. PMLR, 2018.
  • Pittel (1989) B. Pittel. The average number of stable matchings. SIAM Journal on Discrete Mathematics, 2(4):530–549, 1989.
  • Roth and Vate (1990) A. E. Roth and J. H. V. Vate. Random paths to stability in two-sided matching. Econometrica: Journal of the Econometric Society, pages 1475–1480, 1990.
  • Sankararaman et al. (2021) A. Sankararaman, S. Basu, and K. A. Sankararaman. Dominate or delete: Decentralized competing bandits in serial dictatorship. In International Conference on Artificial Intelligence and Statistics, pages 1252–1260. PMLR, 2021.
  • Slivkins (2019) A. Slivkins. Introduction to multi-armed bandits. arXiv preprint arXiv:1904.07272, 2019.
  • Slivkins and Upfal (2008) A. Slivkins and E. Upfal. Adapting to a changing environment: the brownian restless bandits. 2008.
  • Wei and Srivatsva (2018) L. Wei and V. Srivatsva. On abruptly-changing and slowly-varying multiarmed bandit problems. In 2018 Annual American Control Conference (ACC), pages 6291–6296. IEEE, 2018.
  • Whittle (1988) P. Whittle. Restless bandits: Activity allocation in a changing world.

    Journal of applied probability

    , 25(A):287–298, 1988.

Appendix A Related Works on Non-Stationary Bandits

The framework on non stationary bandits were introduced in Whittle (1988) in the framework of restless bandits, and later improved by Slivkins and Upfal (2008). There has been a line of interesting work in this domain–for example in Garivier and Moulines (2011); Auer et al. (2019); Liu et al. (2018) the abruptly changing or switching setup is analyzed, where the arm distributions are piecewise stationary and an abrupt change may happen from time to time. In particular Liu et al. (2018) proposes a change point based detection algorithm to identify whether an arm distribution has changes of not in a piecewise stationary environment. Furthermore, in Besbes et al. (2014), a total variation budgeted setting is considered, where the total amount of (temporal) variation is known, but the change may happen, either smoothly or abruptly.

Moreover, in the above-mentioned total variation budget based non-stationary framework, an adaptive algorithm, that does not require the knowledge of the drift parameter is obtained in Karnin and Anava (2016) for the standard bandit problem and later extended to Luo et al. (2018) for the contextual bandit setup.

On the other hand, there are a different line of research that focuses on the smoothly varying non-stationary environment, in contrast to the above mentioned abrupt or total budgeted setup, for example see Wei and Srivatsva (2018); Krishnamurthy and Gopalan (2021). Note that Wei and Srivatsva (2018) modify the sliding window UCB algorithm of Garivier and Moulines (2011) and employ windows of growing size. On the other hand, very recently Krishnamurthy and Gopalan (2021) analyzed the smoothly varying framework by designing windows of dynamic length and test for optimality within a sliding window. The algorithm of Krishnamurthy and Gopalan (2021), namely Snooze-IT, is an asynchronous algorithm that works on repeated Explore and Commit (ETC) type principle where the explore and commit times are random.

In this paper, we work with the smoothly varying non-stationary framework of Krishnamurthy and Gopalan (2021). We choose this algorithm because of its simplicity, and the dynamics and competition that comes out of a market framework is better understood in such a sliding window based Explore and Commit type algorithm. In general, we believe that our basic principle can be adapted to any sliding window based algorithm in a non-stationary environment.

Appendix B Proof of Theorem 1

b.1 Technical Preliminaries

As is standard in formalizing bandit processes Lattimore and Szepesvári (2020)

, we assume that the random process lies in a probability space endowed with a collection of independent and identically distributed random variables

. For each and , and , the random variables is distributed as the

mean, unit variance Gaussian random variable


Our analysis can be extended verbatim to any sub-gaussian distribution

. With this description, the realized reward by agent , when it matches with arm for the time at time-index is given by . In this description, the set of arm-means are fixed non-random parameters.

Definition 3 (Good Event)


here .

In words, the event is the one in which every contiguous sequence of i.i.d. random variables is ‘well-behaved’. The event is identical to the good-event specified for the single agent case in Krishnamurthy and Gopalan (2021). Standard concentration inequalities give that this occurs with high probability which we record in the proposition below.

Proposition 1
Proof 1

Fix a and . Classical sub-gaussian inequality gives that

Now, taking an union bound over and gives that

The definition of the good event is useful due to the following result.

Lemma 3 (No regret in the exploit-phase)

If the good event in Definition 3 holds, then every agent in every exploit phase will incur regret.

Proof 2

We first prove the result for agent ranked . For any phase of Agent , denote by time to be the time-instant at which an arm and is identified that satisfies for all . In words, time is the time when the statistical test by Agent succeeds. Recall from the notations in the algorithm that .

Suppose in a phase , agent exploits an arm one or more rounds. Notationally, this is from times . We will show that (i) there exists a minimum gap , such that at time , for all arms , the mean of arm exceeds by a certain margin, and (ii) in the duration is set such that the chosen arm continues to be optimal in the entire EXPLOIT phase. The first claim is formalized below.

Claim 1

Under the good event , there exists a time , such that for all arms , .

Proof 3

The statistical test succeeded at time , i.e., there exists a such that , for all . By Definition 1, the window size . Since the test succeeds at time , clearly .

In order to describe the proof, we set some notations. For every arm , denote by the set of times to be the times arm was played in the time-interval . These times are random variables —however conditioned on , these are deterministic since in the Explore phase of Algorithms 1 and 2, agents explore the arms in a round-robin fashion from arms indexed the smallest to the largest. Denote by . For every arm , denote by the random index to be the number of times arm has been played in the past, before time .

Since the statistical test succeeds at time , we have from Definition 1

Re-arranging and using the definition of the Good event, we have

where the second inequality stems from the definition of the good event. Now, since the drift is bounded by , we have that

Combining the preceeding two displays, we get that

The second inequality follows from the fact that the window size is smaller than the explore duration of phase . Now, since the average gap exceeds a bound, it implies that there exists at-least one such that .

Now, since the drift at each time-step in each arm is at-most , arm will remain optimal compared to arm at-least in the time-interval , i.e., arm is optimal compared to arm in the duration . Since , and from Algorithms 1 and 2 the definition of Buffer is , arm is superior to arm in the exploit duration of phase . Now, since arm was arbitrary, this implies that Agent will incur no regret during the exploit phase of .

For the general case, we will prove by induction. Suppose the induction hypothesis that all agent ranked through to are incurring regret in an exploit phase. Notice from the description of Algorithm 2 that agent ranked can potentially go into an exploit phase if and only if all agents ranked through are in an exploit phase. Additionally, the base case of the induction hypothesis is what we established in the preceding paragraph where agent ranked incurs regret in the exploit phase. Under this induction hypothesis, we will now argue that agent ranked will also incur regret in the corresponding exploit phase.

We make one observation based on the serial-dictatorship structure. If all agents ranked through are in (i) Exploit phase and (ii) are incurring regret, then the stable match optimal arm for agent ranked is to play the arm with the highest mean among those arms not being exploited by agents ranked through . This is a simple consequence of the definition of stable match (c.f. Section 3). Thus, it suffices to argue that when agent ranked commits, it commits to the optimal arm. We use identical arguments as for agent ranked to show that.

Claim 2

If at time , for a given with , the statistical test succeeds with arm , then there exists a time , such that for all arms , .

The proof follows identical arguments as that of Claim 1 by using the observation that is a decreasing function of .

This result was shown for the special case of and in Krishnamurthy and Gopalan (2021) Lemma . Lemma 3 generalizes that to the multi-agent and multi-arm setting.

b.2 Other Notations used in the proof

In order to improve readability, we collect all the notations used in the course of the proof.

We now prove the regret of both Agents 1 and 2 for Algorithm 1. Note that, Agent 1 just plays the Snooze-IT algorithm of Krishnamurthy and Gopalan (2021), and hence we borrow the techniques developed there to obtain the regret of Agent 1.

More interestingly, in this section, we provide a full characterization of the regret of Agent 2. Note that since Agent 2 plays on a restrictive or dominated set of arms, dictated by Agent 1, it encounters additional regret. In the description of Algorithm 1, we pointed out the scenarios where Agent 2 is forced to (a) either explore or (b) to stop exploiting. Here, we obtain a regret upper-bound from these forced exploration-exploitation.

To better understand the algorithm, let use focus on a particular phase of Agent 1, say the -th epoch. We use the same notation defined in Algorithm 1. So, denotes that start-time of epoch ans denotes the end of epoch . The exploration duration before committing to an arm is , and so the exploitation phase starts at . Similarly, the length of exploitation is . Let us also assume that the committed arm of Agent 1 in this phase is .

Since Agent 1 plays Snooze-IT, during the exploitation phase, it incurs no regret during the exploitation phase from Lemma 3, and from Krishnamurthy and Gopalan (2021), the (expected) regret of Agent 1 in -th phase is

Technically, the lemmas of Krishnamurthy and Gopalan (2021) are under a good event, which is identical to the good event definition in Definition 3. We now look at the behavior of Agent 2, while Agent 1 is in phase . As shown in Figure 3, there can be multiple phases of Agent 2 inside one phase of Agent 1, and hence let us assume that at the beginning of epoch , the phase number of Agent 2, given by , and by the end of phase , we have .

b.3 Regret of Agent 2 during the exploration period of the th phase of Agent 1

In this phase, which lasts for rounds, we characterize the regret of Agent 2. For this, let us define as the duration, starting from it takes for Agent 2 to commit to an arm unconditionally. This means that in the absence of competition, starting from , Agent 2 would take to commit to an arm by exploring all the arms. We have 2 cases:

Case I (): In this case, since Agent 1 commits first, the regret of Agent 2, is given by . In this case, Agent 2 is not forced to explore.

Case II (): In this case, Agent 2 incurs a regret of plus some additional regret owing to force exploration. The forced exploration comes from the fact that in this case, although Agent 2 has enough information to commit, it still explores because Agent 1 has not committed yet, and the commitment of Agent 2 will cause periodic collisions for Agent 2.

b.3.1 Forced Exploration

We now characterize the regret of Agent 2 form forced exploration. Note that Agent 2 is forced to explore at time if:

  1. Agent 1 is exploring, and

  2. At time , is non-empty, where is the arm played by Agent 1.

Let us understand this in a bit more detail. If is non-empty, it implies that without the presence of competition, Agent 2 would have played arm . This comes from the definition of . Now, when Agent 1 is playing that arm, it implies a forced exploration on Agent 2. We can write down the above forced exploration term as the following