1 Introduction
Model  Reference  Asymptotic Upper Bound (up to constant factor) 

Centralized Multiplayer  Komiyama et al. (2015)  
Decentralized, Col. Sensing  SICMMAB (Boursier and Perchet, 2019)  
Decentralized, No Sensing  Lugosi and Mehrabian (2018)  
Decentralized, No Sensing  Lugosi and Mehrabian (2018)  
Decentralized, No Sensing  ADAPTED SICMMAB (Boursier and Perchet, 2019)  
Decentralized, No Sensing  SICMMAB2 (Boursier and Perchet, 2019)  
Decentralized, No Sensing  ECSIC (this paper) 
: number of arms; : number of players; is th order statistics of ; ; ; is the random coding error exponent.
Recent years have witnessed an increased interest in the multiplayer multiarmed bandits (MPMAB) problem, in which multiple players simultaneously play the bandit game and interact with each other through arm collisions. In particular, motivated by the practical application of cognitive radio (Anandkumar et al., 2011), decentralized stochastic MPMAB problems have been widely studied. See Section 7 for a review of the related work.
Since the stochastic MAB problem for a single player is well understood, a predominant approach in decentralized MPMAB is to let each player play the singleplayer MAB game while avoiding collisions for as much as possible (Liu and Zhao, 2010; Avner and Mannor, 2014; Rosenski et al., 2016; Besson and Kaufmann, 2018). Intuitively, this would allow the algorithm to behave as the singleplayer MAB. Comparing to the centralized MPMAB, there is a multiplicative factor (number of players) increase in the regret coefficient of . This has long been considered fundamental due to the lack of communication among players.
Recently, a surprising and inspiring approach, called SICMMAB, is proposed in Boursier and Perchet (2019) for the collisionsensing MPMAB problem. Instead of viewing collision as detrimental, Boursier and Perchet (2019) purposely instigates collisions as a way to communicate between players. With a careful design, an internal rank can be assigned to each player and arm statistics can be completely shared among players at a communication cost that does not dominate the arm exploration regret, which leads to an overall regret approaching the centralized setting. The proposed communication phase transmits the total reward by using collision/nocollision to represent bit 1/0. The theoretical analysis shows, for the first time, that the regret of a decentralized MPMAB algorithm can approach that of the centralized counterpart, which represents a significant progress in decentralized MPMAB.
The nosensing problem, on the other hand, represents arguably the most difficult setting in MPMAB and there has been little progress in the literature. Boursier and Perchet (2019) makes two attempts to generalize the forced collision idea to this setting. Directly applying SICMMAB to this setting leads to a O() regret, which has an additional multiplicative coefficient due to no sensing. In other words, a straightforward application of SICMMAB results in the communication loss dominating the total regret. Then, the authors propose a different approach: use communication only to exchange the accepted and rejected arms, thus reducing the regret caused by communication. However, this approach, philosophically speaking, deviates from the core idea of SICMMAB and does not fully utilize the communication benefit of collision (arm statistics are not shared among players). This is also the reason why the multiplicative factor reappears in the regret formula, which was eliminated in the collisionsensing case by SICMMAB. It remains an open problem whether a decentralized MPMAB algorithm without collision information can approach the performance of its centralized counterpart.
In this work, we return to the original idea of utilizing collisions to communicate sampled arm rewards. By modelling no collision information as the Zchannel communication problem in information theory, we propose to incorporate optimal error correction coding in the communication phase to control the error rate of decoding the message. With this approach, we are able to transmit a quantized sample reward mean with a fixed length for each player without having the communication loss dominate the total regret. The resulting asymptotic regret improves the coefficients over SICMMAB2, and represents the best known regret in the nosensing MPMAB model to the best of the authors’ knowledge. Table 1 compares the asymptotic regret upper bounds for different algorithms. We also propose two practical enhancements that significantly improve the algorithm’s empirical performance. Numerical experiments on both synthetic and realworld datasets corroborate the analysis and offer interesting insights into ECSIC.
2 The NoSensing MPMAB Problem
In the standard (single player) stochastic MAB setting, there are arms, with rewards of arm sampled independently from a distribution on , where . At time , a player chooses an arm and the goal is to receive the highest mean cumulative reward in rounds.
In this section, we introduce the nosensing multiplayer MAB model with a known number of arms but an unknown number of players . The horizon is known to the players. At each time step , all the players simultaneously pull the arms and receive the reward such that
where is the collision indicator defined by with .
If players can observe both and , it is the collisionsensing problem. On the other hand, in the nosensing case as in our paper, players can only access , i.e., a reward of can indistinguishably come from a collision with another player or . Note that if , the nosensing and collisionsensing models are equivalent^{1}^{1}1We further note that the nosensing model can be generalized to an arbitrary but bounded reward support where collision results in the lowest value in the support..
The performance in the standard singleplayer MAB setting is usually measured by the regret:
where is the expected reward of the arm with the highest expected reward. As shown in the lower bound by Lai and Robbins (1985), the optimal order of the regret cannot be better than .
In the multiplayer setting, the notion of regret can be generalized and defined with respect to the best allocation of players to arms, as follows:
where is th order statistics of , i.e. .
Two technical assumptions are made in this paper, which are also widely used in the literature. The first is a strictly positive lower bound of , which has been used by Lugosi and Mehrabian (2018) and Boursier and Perchet (2019) for the nosensing model. The second assumption is a finite gap between the optimal and suboptimal (group of) arms; see Avner and Mannor (2014); Kalathil et al. (2014); Rosenski et al. (2016); Nayyar et al. (2016).
Assumption 1.

[leftmargin=12pt,topsep=0pt, itemsep=0pt,parsep=0pt]

A positive lower bound of is known to all players: .

There exists a positive gap , and it is known to all players.
Assumption 1.1 is equivalent to , . This also bounds . Note that although provides a lower bound for , Assumption 1.1 does not require the exact value of . The gap in Assumption 1.2 measures the difficulty of the bandit game and ensures the existence of only one optimal choice.
3 The ECSIC Algorithm
The proposed error correction synchronization involving communication (ECSIC) is compactly described in Algorithm 1
. Similar to SICMMAB, the overall algorithm can be structurally divided into four phases: initialization phase, exploration phase, communication phase, and exploitation phase. It is important to note that all players are synchronized in running ECSIC, i.e., they enter each phase at the same time (or at least with high probability in some cases) except the exploitation phase. Until a player fixates on a specific arm and enters the exploitation phase, the algorithm keeps iterating between the exploration and communication phases. Players that have (not) entered the exploitation phase are called inactive (active). We denote the set of active players during the
th phase by and its cardinality by . Similarly, arms that have not been decided to be optimal or suboptimal are called active. The set of active arms during the th phase is denoted by with cardinality .3.1 Initialization phase
The same structure of the initialization phase as Boursier and Perchet (2019) is used in ECSIC, which outputs an internal rank
for each player as well as the estimated value of
. It starts with a “Musical Chair” phase and is followed by a socalled Sequential Hopping protocol. The full procedure is described in Appendix B.1 for completeness.3.2 Exploration phase
During the th exploration phase, active players sequentially hop among the active arms for steps, and any active arm is pulled times by each active player. Since the hopping is based on each player’s internal rank, the exploration phase is collisionfree.
We note that the length of an exploration phase is different from Boursier and Perchet (2019), which is a key component of the performance improvement. The difference of a factor, in fact, results in an overall rounds of exploration and communication phases in the ADAPTED SICMMAB algorithm of Boursier and Perchet (2019). This directly leads to a dominating communication loss that breaks the orderoptimality. With an expansion of length by in ECSIC, the overall rounds become a constant, and the communication regret can be better controlled as shown in Section 4.
3.3 Communication phase
In the communication phase, all players attempt to exchange their sampled reward information in a synchronized and distributed manner. The communication takes place via a careful collision design. All players enter this phase synchronously and, by default, keep pulling different arms based on their internal ranks. Then, when it is player ’s turn to communicate with player , she would purposely pull (not pull) player ’s arm as a way to communicate bit 1 (0). If player can fully access the collision information, i.e., knowing whether collision happens or not at each time step, she will be able to receive the bit sequence successfully, which conveys player ’s sample reward statistics. However, for the nosensing model, such errorfree communication becomes impossible.
Three new ideas are used in the communication phase of ECSIC. The first is the introduction of Zchannel coding. In the nosensing scenario, players cannot directly identify collision. If the same communication protocol in Boursier and Perchet (2019) (representing or by collision or no collision) is used, the confusion may mislead the player to believe that collision has occurred (bit ) while it is actually a null statistic of reward sampling (bit ). This error has a catastrophic consequence in that it breaks the essential synchronization between players. We are thus facing the challenge of communicating the reward statistics to other users while controlling the error rate for the overall communication loss to not dominate the regret.
Luckily, this is the wellknown reliable communication over a noisy channel problem, one of the foundations in information theory. In particular, our communication channel is asymmetric: (collision) is always received correctly and (no collision) may be received incorrectly with a certain probability . This corresponds to the Zchannel model (see Fig.1) in information theory (Tallini et al., 2002), which represents a broad class of asymmetric channels. The Zchannel has a crossover probability of that corresponds to ^{2}^{2}2Since the crossover probability is unknown and varies for different arm , is used to capture the worst case..
The Zchannel capacity is derived in Tallini et al. (2002) as follows.
Theorem 1.
The capacity of a Zchannel with crossover probability is:
(1) 
Shannon theory guarantees that as long as the coding rate is below in Eqn. (1), there exists at least one code that allows for an arbitrarily low error rate asymptotically. This means that theoretically, for this Zchannel, it is possible to transmit information nearly errorfree when the rate is close to bits per channel use. In reality, however, different finite blocklength channel codes may have different performances; we thus evaluate several practical codes both theoretically (in Section 4) and experimentally (in Section 6). For simplicity, Functions Send(), Receive(), Encoder() and Decoder() are used in the algorithm as the sending and receiving protocol and the encoder and corresponding decoder, respectively.
The second enhancement is to transmit each arm’s quantized sample reward mean with a fixed length. The reason not to use the total reward as Boursier and Perchet (2019) is that the gradually increasing total reward leads to a message length , which cannot be transmitted efficiently in the nosensing case with bits. However, with a finite gap, a less precise statistics sharing is tolerable as long as it does not affect the choice of optimal arms. For a quantized sample mean of length , the error is at most . We thus control the length such that , where is a predefined constant. By constructing as the confidence bound, analysis in Section 4 shows that acceptation and rejection maintain a high probability of success.
Lastly, compared to the meshstructured communication in Boursier and Perchet (2019), it is more efficient to form a tree structure that one player (“leader”) gathers all statistics and makes decisions for others (“followers”). The player with internal rank becomes the leader and the rest become the followers (Kaufmann and Mehrabian, 2019). Statistics of arms are transmitted from followers to the leader. The leader decides the set of arms to be accepted or rejected by comparing their upper confidence bounds and lower confidence bounds with each other, and sends back to the followers. Upon reception, active players either enter another iteration of exploration and communication, or begin exploitation. This process utilizes reward statistics from all players and has better communication efficiency. Procedures for the leader and followers are given in Algorithm 2 and 3, respectively.
4 Theoretical Analysis
The overall regret of ECSIC can be decomposed as . The first, second and third term refers to the regret caused by the initialization, exploration, and communication phase, respectively. The main result is presented in Theorem 2, and each component regret is subsequently analyzed. Detailed proofs can be found in Appendix B.
Theorem 2.
With an optimal coding technique that achieves Gallager’s error exponent for the corresponding Zchannel with crossover probability , for any , we have
(2)  
where , and are constants and .
Theorem 2 involves an informationtheoretic concept called error exponent, which is explained in Theorem 3 in Section 4.2 but more details can be found in (Gallager, 1968).
An asymptotic upper bound can be obtained from (2) with :
(3) 
Compared to SICMMAB2, we have successfully removed the multiplicative factor of in the first term. This is due to the efficient communication phase that transmits the reward statistics. In addition, we have a factor in the second term, as opposed to in SICMMAB2. This is also an improvement since . We also note that Eqn. (2) and Eqn. (3) hold when is replaced by .
To prove Theorem 2, we first define the “typical event” as the success of initialization, communication and exploration throughout the entire horizon . More specifically, we define three events: each player has a correct estimation of and an orthogonal internal rank after initialization; messages are decoded correctly in all communication phases; . We use to denote the probability that the typical event happens, which is . The regret caused by the “atypical event” can be simply bounded by a linear regret . Then the result of (2) can be proved by controlling to balance both events.
4.1 Initialization phase
Similar to Lemma 11 in Boursier and Perchet (2019), we can bound the regret of initialization as follows.
Lemma 1.
With probability , event happens. Furthermore, the regret of the initialization phase satisfies:
4.2 Exploration phase
The regret due to exploration is bounded in Lemma 2.
Lemma 2.
With probability , the typical event happens and the exploration regret conditioned on the typical event satisfies:
We first present a fundamental result of channel coding for communication in a noisy channel, known as the error exponent (Gallager, 1968).
Theorem 3.
For a discrete memoryless channel, if , there exists a code of block length without feedback such that the error probability is bounded by
where is the random coding error exponent with rate .
We note that the error exponent used in Theorem 2 corresponds to .
Theorem 3 suggests that, to transmit a bit message over a Zchannel, there exists an optimal coding scheme with length to achieve an error rate less than , where . Several of the existing coding techniques, although not optimal, can achieve this error rate with , which only leads to a multiplicative factor larger than but does not change the regret order. For example, with repetition code, flip code and modified Hamming code, we have , , respectively (see Appendix A for detailed analysis of practical codes). The remaining analysis will be based on the optimal channel coding with the caveat that a “good” Zchannel code should be applied in practice.
With at most exploration and communication phases and arms to be accepted or rejected, there are at most communication instances on arm statistics, communication instances on the number of acc/rej arms, and communication instances on the index of acc/rej arms. A simple union bound analysis leads to the following result.
Lemma 3.
Denoting the probability that event holds by , with an optimal Zchannel code of , we have
Lemma 3 guarantees all communications are correct. To bound the probability that all arms are correctly estimated, we have the following result.
Lemma 4.
In phase , for any active arm ,
With at most explorationcommunication phases, event happens with probability:
(4) 
A union bound argument leveraging , and leads to probability for the typical event to happen, as defined in Lemma 2. Finally, for the exploration phases, the number of times that an arm is pulled before being accepted or rejected are well controlled.
Lemma 5.
In the typical event, every optimal arm is accepted after at most pulls, and every suboptimal arm is rejected after at most pulls.
Denote as the overall time of exploration and exploitation phase and as the number of time steps where the th best arm is pulled during these two phases. With no collision in exploration and exploitation, the exploration regret can be decomposed as (Anantharam et al., 1987)
(5)  
Both components in (5) can be upper bounded by Lemma 7 in Appendix B.2.4, which proves Lemma 2.
4.3 Communication phase
Thanks to the expanded length of each exploration phase and the fixedlength quantization of arm statistics, the regret does not dominate the overall regret, as stated in Lemma 6.
Lemma 6.
In the typical event,
We note that becomes a constant when is sufficiently large. Noting that , the communication loss has the same order as other phases.
4.4 Overall regret
5 Algorithm Enhancement
ECSIC has nice theoretical performance guarantees, but we have noticed that in practice some minor enhancements improve its performance significantly, which are shown in the next section. First, after each exploration and communication phase, player can use the active arm with the th best empirical mean (sent by the leader) as her communication arm for the next around, instead of the th active arm. Since players keep receiving rewards from their communication arms while waiting for communication or receiving bit , using an arm with higher empirical mean can lead to a lower loss in these time steps.
Second, we have observed in practice that the first one or two explorations do not lead to effective acceptation or rejection even when the game is easy, which means all the communication losses during these phases are incurred with no benefit (this is much larger than the exploration loss). Thus, can be initialized to a larger integer (e.g. ), which leads to a longer exploration to start with and less ineffective communication.
Lastly, if and in Assumption 1 are not available,
adaptive estimation with confidence intervals
can be used to replace the true and in ECSIC. The influence of mismatched and are evaluated in the experiments and reported in Appendix C.2.6 Experiments
Numerical experiments have been carried out to verify the analysis of ECSIC and compare its empirical performance to other methods. All rewards follow the Bernoulli distributions with
, and we set . Results are obtained by averaging over 500 experiments. More detailed discussions and additional results can be found in Appendix C.We compare state of the art algorithms under both easy and difficult bandit game settings. ECSIC (with repetition code), ADAPTED SICMMAB, SICMMAB2, and the algorithm proposed by Lugosi and Mehrabian (2018) (labeled as “nosensingMC”) are first compared in a relatively easy game (). Fig. 7 shows that even in an easy game, nosensingMC could not finish exploration within time steps, and ADAPTED SICMMAB has poor performance compared to the other two. Both ECSIC and SICMMAB2 converge to the optimal arm set quickly, but the overall regret of ECSIC is smaller. For a hard game with , Fig. 7 shows that ECSIC is superior to SICMMAB2.
A detailed comparison of ECSIC with SICMMAB2 is done by comparing their regrets as a function of the gap in Fig.7. We see that when the game is not extremely difficult (), ECSIC has better performance since players benefit from sharing statistics. When becomes extremely small, the required communication length increases significantly, leading to a dominating communication regret in ECSIC that cannot be offset by the benefits of sharing statistics.
Fig. 7 reports the performance while using different Zchannel codes in communication. We observe that modified Hamming Code has the best performance, which is due to its superior error correction capability. This observation also implies that with a nearoptimal code that is specifically designed for Zchannel, performance of ECSIC can be further improved.
We also evaluate the impact of codeword length on the regret. For our simulation setting, the theoretical analysis requires a repetition code length to transmit one bit, in order to achieve an error rate of . We are interested in evaluating whether the theoretically required code length can be shortened in practice. Under the easy game setting of Fig. 7 with rounds averaging, Fig. 7 shows that with decreasing from to , the regret decreases . More importantly, it shows that the convergence of ECSIC does not change. When further reducing to , we see the regret curve trends upward at large , which represents a nonnegligible loss due to unsuccessful communications. With , the regret increases rapidly, indicating that players suffer from an increased error rate. It is thus essential to strike a balance between error rate and communication loss.
Lastly, we evaluate ECSIC on a real world dataset: the movie watching dataset (ml20m) from MovieLens (Harper and Konstan, 2015). It consists of watching data of more than movies from over users between January 09, 1995 and March 31, 2015. In the preprocessing, we group these movies into categories by their total number of views from high to low. The binary reward at time (hour) is defined as whether there are users watching films in this group, and we replicate it to a final reward sequence of length . players are assumed to engage in the game. This final sequence has and . Compared to synthetic datasets, this setting poses a larger and more difficult game. For each experiment, the reward sequence is randomly shuffled. We report the cumulative regret of ECSIC and SICMMAB2, averaged over experiments, in Figure 7. One can see that the advantage of ECSIC over SICMMAB2 is significant for this realworld dataset. Intuitively, this is because the game is hard ()^{3}^{3}3However, the game is also not too hard for communication to be ineffective as the case of Fig. 7., and and are also large.
7 Related Work
Depending on how information is shared and actions are determined, existing literature can be categorized into centralized or decentralized (distributed) MPMAB problems. The centralized scenario can be viewed as an application of the multipleplay bandit (Anantharam et al., 1987; Komiyama et al., 2015). A more interesting and challenging problem, introduced by Liu and Zhao (2010) and Anandkumar et al. (2011), lies in the decentralized scenario where explicit communications between players are not allowed and thus collision may happen. For the collisionsensing MPMAB problem, earlier works attempt to let each player play the singleplayer MAB game while avoiding collisions for as much as possible; see (Liu and Zhao, 2010; Avner and Mannor, 2014; Rosenski et al., 2016) for some representative approaches.
The SICMMAB algorithm in Boursier and Perchet (2019) is closely relevant to our work, which proposes to exploit collisions as opposed to avoiding them. Proutiere and Wang (2019) further refines this idea and decreases the communication regret so that the lower bound of the centralized setting can be approached asymptotically for Bernoulli distributed rewards. The SICMMAB principle has subsequently been applied to other multiplayer settings. For example, Kalathil et al. (2014) considers an extended multiplayer model where reward distribution varies for each player. Bistritz and Leshem (2018) proposes a Game of Thrones algorithm that achieves a regret of . This is further improved by Kaufmann and Mehrabian (2019), leading to an improved regret of .
The nosensing model, on the other hand, is very challenging and limited progress has been made so far. In Lugosi and Mehrabian (2018), sample means are rectified by the probability of collision and then the same Musical Chair approach is adopted. As discussed in Section 1, Boursier and Perchet (2019) touches upon the nosensing model with ADAPTED SICMMAB and SICMMAB2. However, the former has a communication loss of that dominates the total regret, while the latter drifts away from communication of full statistics, thus is fundamentally incapable of approaching the centralized performance.
8 Conclusion
In this work, we have proposed the ECSIC algorithm for the nosensing MPMAB problem with forced collision. We proved that it is possible for a decentralized MPMAB algorithm without collision information to approach the performance of its centralized counterpart. Recognizing that communication under the nosensing setting corresponds to the Zchannel model in information theory, optimal error correction codes are applied for reliable communication via collision. With this tool, we return to the original idea of utilizing forced collisions to share complete arm statistics among players. By expanding exploration phases and fixing the message length, an orderoptimal communication loss is achieved. Practical simulation results with several Zchannel codes have proved the superiority of ECSIC algorithm under different bandit game settings, using both synthetic and realworld datasets.
Acknowledgements
JY acknowledges the support from U.S. National Science Foundation under Grant ECCS1650299.
References
 Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications 29 (4), pp. 731–745. Cited by: §1, §7.
 Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays  part I: IID rewards. IEEE Trans. Autom. Control 32 (11), pp. 968–976. Cited by: §4.2, §7.

Concurrent bandits and cognitive radio networks.
In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pp. 66–81. Cited by: §1, §2, §7.  Maximum likelihood decoding of codes on the Zchannel. In IEEE International Conference on Communications, Vol. 3, pp. 1200–1205. Cited by: §A.3.
 Multiplayer bandits revisited. In Proceedings of Algorithmic Learning Theory, pp. 56–92. Cited by: §1.
 Distributed multiplayer bandits  a game of thrones approach. In Advances in Neural Information Processing Systems, pp. 7222–7232. Cited by: §7.
 SICMMAB: synchronisation involves communication in multiplayer multiarmed bandits. In Advances in Neural Information Processing Systems, pp. 12071–12080. Cited by: §B.1, §B.2.3, §C.1, Document, Table 1, §1, §1, §2, §3.1, §3.2, §3.3, §3.3, §3.3, §4.1, §7, §7.
 Optimal ultrasmall blockcodes for binary discrete memoryless channels. IEEE Transactions on Information Theory 59 (11), pp. 7346–7378. Cited by: §A.1, §A.2.
 Information theory and reliable communication. John Wiley & Sons, USA. Cited by: §4.2, §4.
 The MovieLens datasets: history and context. ACM Transactions on Interactive Intelligent Systems 5 (4). Cited by: §6.
 Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory 60 (4), pp. 2331–2345. Cited by: §2, §7.
 New algorithms for multiplayer bandits when arm means vary among players. arXiv preprint arXiv:1902.01239. Cited by: §3.3, §7.

Optimal regret analysis of Thompson sampling in stochastic multiarmed bandit problem with multiple plays
. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1152–1161. Cited by: Table 1, §7.  Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §2.
 Distributed learning in multiarmed bandit with multiple players. IEEE Transactions on Signal Processing 58 (11), pp. 5667–5681. Cited by: §1, §7.
 Multiplayer bandits without observing collision information. arXiv preprint arXiv:1808.08416. Cited by: Table 1, §2, §6, §7.
 On regretoptimal learning in decentralized multiplayer multiarmed bandits. IEEE Transactions on Control of Network Systems 5 (1), pp. 597–606. Cited by: §2.
 An optimal algorithm in multiplayer multiarmed bandits. arXiv preprint arXiv:1909.13079. Cited by: §7.
 Multiplayer bandits – a musical chairs approach. In Proceedings of The 33rd International Conference on Machine Learning, pp. 155–163. Cited by: §1, §2, §7.
 On the capacity and codes for the Zchannel. In Proceedings of the IEEE International Symposium on Information Theory, Vol. , pp. 422. Cited by: §3.3, §3.3.
Supplementary Material: Decentralized Multiplayer Multiarmed Bandits with No Collision Information
Chengshuai Shi &Wei Xiong &Cong Shen &Jing Yang
University of Virginia University of Virginia University of Virginia Pennsylvania State University
References
 Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications 29 (4), pp. 731–745. Cited by: §1, §7.
 Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays  part I: IID rewards. IEEE Trans. Autom. Control 32 (11), pp. 968–976. Cited by: §4.2, §7.

Concurrent bandits and cognitive radio networks.
In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pp. 66–81. Cited by: §1, §2, §7.  Maximum likelihood decoding of codes on the Zchannel. In IEEE International Conference on Communications, Vol. 3, pp. 1200–1205. Cited by: §A.3.
 Multiplayer bandits revisited. In Proceedings of Algorithmic Learning Theory, pp. 56–92. Cited by: §1.
 Distributed multiplayer bandits  a game of thrones approach. In Advances in Neural Information Processing Systems, pp. 7222–7232. Cited by: §7.
 SICMMAB: synchronisation involves communication in multiplayer multiarmed bandits. In Advances in Neural Information Processing Systems, pp. 12071–12080. Cited by: §B.1, §B.2.3, §C.1, Document, Table 1, §1, §1, §2, §3.1, §3.2, §3.3, §3.3, §3.3, §4.1, §7, §7.
 Optimal ultrasmall blockcodes for binary discrete memoryless channels. IEEE Transactions on Information Theory 59 (11), pp. 7346–7378. Cited by: §A.1, §A.2.
 Information theory and reliable communication. John Wiley & Sons, USA. Cited by: §4.2, §4.
 The MovieLens datasets: history and context. ACM Transactions on Interactive Intelligent Systems 5 (4). Cited by: §6.
 Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory 60 (4), pp. 2331–2345. Cited by: §2, §7.
 New algorithms for multiplayer bandits when arm means vary among players. arXiv preprint arXiv:1902.01239. Cited by: §3.3, §7.

Optimal regret analysis of Thompson sampling in stochastic multiarmed bandit problem with multiple plays
. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, pp. 1152–1161. Cited by: Table 1, §7.  Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6 (1), pp. 4–22. Cited by: §2.
 Distributed learning in multiarmed bandit with multiple players. IEEE Transactions on Signal Processing 58 (11), pp. 5667–5681. Cited by: §1, §7.
 Multiplayer bandits without observing collision information. arXiv preprint arXiv:1808.08416. Cited by: Table 1, §2, §6, §7.
 On regretoptimal learning in decentralized multiplayer multiarmed bandits. IEEE Transactions on Control of Network Systems 5 (1), pp. 597–606. Cited by: §2.
 An optimal algorithm in multiplayer multiarmed bandits. arXiv preprint arXiv:1909.13079. Cited by: §7.
 Multiplayer bandits – a musical chairs approach. In Proceedings of The 33rd International Conference on Machine Learning, pp. 155–163. Cited by: §1, §2, §7.
 On the capacity and codes for the Zchannel. In Proceedings of the IEEE International Symposium on Information Theory, Vol. , pp. 422. Cited by: §3.3, §3.3.
Appendix A Error Correction Codes for Communication over the Zchannel
More details about the representative coding techniques for the Zchannel are provided in this section.
a.1 Repetition code
Repetition code seems simple but is surprisingly powerful in the Zchannel. Chen et al. (2013) has proved that it is the optimal code for . The encoding and decoding processes are described as follows.

Encoding. Repeat bit or bit in message for times to generate codeword .

Decoding. For channel output , if there exists such that , then the decoder outputs 0. Otherwise, we have for all , and the decoder outputs 1.
With a crossover probability no larger than , the bit error probability is:
For a message length of bits, the error probability is:
With the choice of , we have . Thus, the total code length for a bit message is:
With , the regret remains orderoptimal.
a.2 Flip code
The flip code is designed by Chen et al. (2013) to better utilize the Zchannel property. The encoding and decoding processes are illustrated with the case of 4 codewords as follows.

Encoding. Assuming we encode every two bits into a bit codeword, the encoding function is:

Decoding. It is similar to the repetition code. A codeword of length will be divided into of length and of length

if all bits in and are s, decoder outputs ;

if all bits in are s and contains , decoder outputs ;

if contains and all bits in are s, decoder outputs ;

for all other cases, decoder outputs .

With a crossover probability no larger than , the bit error probability is (Chen et al., 2013):
The inequality holds because the function monotonically increases for . For a message length of bits (we assume is even here, otherwise an additional bit
can always be padded to make it even), the error probability is:
With the choice of , we have . Thus, the total codeword length for a message of length is:
With , the regret remains orderoptimal.
a.3 Modified Hamming code
As the number of codewords increases to (4 bits), a modified (,) Hamming Code can be designed. It is a concatenated code, with the standard (,) Hamming code as the inner code and a repetition code as the outer code.

Encoding. The standard (7,4) Hamming encoding matrix is first used to encode a 4bit message to a 7bit codeword. Then we repeat each bit of the 7bit codeword times, leading to a bit codeword;

Decoding. First by using the repetition code’s decoding rule, bit coded message is decoded into bits. This bits is then decoded with the standard (7,4) Hamming decoding matrix . The final output is a decoded bit message.
The repetition code reduces the crossover probability from to . With this relatively small crossover probability and the error correction capability of the Hamming Code, a reliable performance can be achieved. As stated by Barbero et al. (2006), with as the crossover probability, we have the following error rate for the Hamming Code over a Zchannel.
We neglect in the following analysis. The error probability of transmitting bit messages (assuming can be divided by ) using the (,) modified Hamming code is:
(6)  
By choosing , we have . Thus, the total codeword length for a message of length is:
which is still , but the bound in (6) indicates an improvement over the repetition code and flip code.
Comments
There are no comments yet.