Introduction
Consider two slot machines. Both machines have individual reward probabilities
and . At each trial, a player selects one of machines and obtains some reward, for example, a coin, with the corresponding probability. The player wants to maximize the total reward sum obtained after a certain number of selections. However, it is supposed that the player does not know these probabilities. The multiarmed bandit problem (MBP) is to determine the optimal strategy for selecting the machine which yields maximum rewards by referring to past experiences.In our previous studies [1, 2, 3, 4, 5, 6], we have shown that our proposed algorithm called the TugofWar (TOW) dynamics is more efficient than other wellknown algorithms such as the modified greedy algorithm and modified softmax algorithm, and comparable to the ‘upper confidence bound1tuned (UCB1T) algorithm’ that is known as the best algorithm among parameterfree algorithms [7]. Moreover, the TOW dynamics effectively adapts to a changing environment in which the reward probabilities dynamically switch. The algorithms for solving the MBP are useful for various applications, such as the cognitive radio [8, 9], web advertising [10], and the MonteCarlo tree search that is used for programming computers to play ‘game of GO’ [11, 12].
Recently, the cognitive medium access problem is one of the hottest topics in the field of mobile communications [8, 9]. The underlying idea is to allow unlicensed users (i.e., cognitive users) to access the available spectrum when the licensed users (i.e., primary users) are not active. The cognitive medium access is a new medium access paradigm in which the cognitive users should not interfere with the licensed users. To avoid interfering with the primary network, the cognitive users must first probe to determine whether there are primary activities in each channel before transmission.
Figure 1 shows the channel model proposed by Lai et al. [8, 9]. There is a primary network consisting of channels, each with bandwidth B. The users in the primary network are operated in a synchronous timeslotted fashion. It is assumed that, at each time slot, channel is free with probability . The cognitive users do not know a priori.
At each time slot, the cognitive users attempt to exploit the availability of channels in the primary network by sensing the activity in this channel model. In this setting, a single cognitive user can access only a single channel at any given time. The problem is to derive an optimal accessing strategy for choosing channels that maximizes the expected throughput obtained by the cognitive user. This situation can be interpreted as the multiuser competitive bandit problem (CMBP).
For simplicity, we consider the minimum CMBP, i.e., 2 cognitive (unlicensed) users (1 and 2) and 2 channels ( and ). Each channel is not occupied by primary (licensed) users with the probability . In the MBP context, we assume that the user accessing a free channel can get some reward, for example a coin, with the probability . Table 1 shows the payoff matrix for user 1 and 2.
user 2: A  user 2: B  

user 1: A  ()  () 
user 1: B  ()  () 
When two cognitive users select the same channel, the collision occurs, and the reward is evenly split between the collided users.
In order to develop a unified framework for the design of efficient, and low complexity, cognitive medium access protocols, we have to seek an algorithm that can obtain the maximum total rewards (scores) in the CMBP context. In order to acquire the maximum total rewards, the algorithm has to have a mechanism that can avoid the ‘Nash equilibrium’ which is the natural consequence for a group of independent selfish users.
In this study, we demonstrate that overall optimization (the maximum total rewards) can be derived by using a physical device consisting of two kinds of incompressiblefluid in two or more cylinders. We call this analog computing device the ‘TugofWar (TOW) Bombe’ because it is analogous to the ‘Turing Bombe,’ which is an analog electric circuit developed by the British army during World War II for decoding the ‘enigma code’ of the German army [13]. If one tries to solve the CMBP for users and channels using a conventional digital computer, it is necessary to calculate evaluation values of for each iteration; the computational cost for solving the CMBP grows as an exponential function of and . Nevertheless, the TOW Bombe enables to solve the problem without paying the exponential computational cost. At each iteration, the TOW Bombe only requires upanddown operations for controlling the fluid interface levels in the corresponding cylinders.
1 The TugofWar Dynamics
Consider incompressiblefluid in a cylinder, as shown in Fig. 2. Here, variable corresponds to the displacement of terminal from an initial position, where . If is greater than , we consider that the liquid selects machine .
We used the following estimate
():(1) 
Here, is the number of playing machine until time , and is the number of nonrewarded (i.e., failed) events in until time , where is a weighting parameter.
The displacement () is determined by the following difference equation:
(2) 
Here, is an arbitrary fluctuation to which the liquid is subjected. Consequently the TOW dynamics evolve according to a particularly simple rule: in addition to the fluctuation, if machine is played at each time , and are added to when rewarded and nonrewarded, respectively (Fig. 2). The authors have shown that this simple dynamics gains more rewards (coins or packet transmissions) than that obtained by other popular algorithms for solving the MBP [1, 2].
1.1 The TugofWar Principle
In this subsection, we derive the learning rules of the TOW dynamics from a thought experiment, so that we can obtain the nearly optimal weighting parameter . In many popular algorithms such as greedy algorithm, an estimate for reward probability is updated only in a selected arm. In contrast, we consider the case that the sum of the reward probabilities is given in advance. Then, we can update both estimates simultaneously as follows,
A:  B:  ,  
A:  B:  . 
Here, the top and bottom rows give the estimates based on times selecting A and times selecting B, respectively.
Each expected reward based on times selecting A and times selecting B is given as follows,
(3)  
Here, is if is , or if is . These expected rewards s are not the same as the learning rules of the TOW, s in Eq.(1). However, the following difference is directly used in the TOW,
(4) 
When we transform the expected rewards s into
(5) 
we can obtain the difference
(6) 
Comparing the coefficient of Eq.(4) and (6), those two differences are always equal when satisfies
(7) 
Eventually, we can obtain the nearly optimal weighting parameter in terms of .
This derivation means that the TOW has an equivalent learning rule with the system that is able to update both of the two estimates simultaneously. The TOW can imitate the system that determines its next moves at time in referring to the estimate of each machine even if it was not selected at time , as if the two machines were selected simultaneously at time . This unique feature in the learning rule is one of origins of the high performance of the TOW.
We carried out Monte Carlo simulations and confirmed that the performance of the TOW with is comparable to its best performance, i.e., TOW with . Detailed descriptions on these results will be presented elsewhere [14]. In addition, the essence of the process described here can be generalized to machine and player cases. All we need is the following :
(8)  
(9) 
Here, denotes the top th reward probability. In fact, for machine and player cases, we have designed a physical decisionmaking device that achieves the overall optimal state quickly and accurately [15].
2 The TugofWar Bombe
The decisionmaking device called the ‘TugofWar (TOW) Bombe’ for 3 users (, and ) and 5 channels (, and ) is illustrated in Figure 3.
Two kinds of incompressiblefluid (red and blue) are filled in coupled cylinders. The red (bottom) fluid handles the ‘decisionmaking of a user’, while the blue (upper) one handles the ‘interaction among users’. Channel selection of each user at each iteration is determined by the height of a green adjuster (a fluid interface level), and the highest channel is chosen. When the movements of red and blue adjusters stabilize to reach equilibrium, the ‘tugofwar principle’ in red fluid holds for each user. In other words, when one interface goes up, other four interfaces fall down, and efficient channel selections are attained. Simultaneously, the ‘actionreaction law’ is held by blue fluid (i.e., if the interface level of user1 goes up, the interface levels of user2 and 3 fall down), which contributes to avoid collisions, and the TOW Bombe is able to search for an overall optimization solution accurately and quickly.
The dynamics of the TOW Bombe are expressed as follows:
(10)  
(11) 
Here, denotes the height of the interface of user and channel at iteration step . If channel is chosen for user at time , is or according to the result (rewarded or not). Otherwise, it is .
In addition to the abovementioned dynamics, oscillations are added to . These oscillations are given from the external by controlling the blue and red adjusters appropriately. In this paper, we show the cases where the completelysynchronized oscillations are added to all the users,
(12) 
Here, , , .
Thus, the TOW Bombe operates only by adding an operation which goes up or down the interface level ( or ) according to the result (success or failure of packet transmission) for each user (total times) at every time. After these operations, the interface levels move according to the volume conservation law, and it calculates next selection for each user. In the each user’s selection, an efficient search is realized due to the ‘TOW principle’ which can obtain a solution accurately and quickly in trialanderror tasks. Moreover, by the interaction between users via blue fluid, the ‘Nash equilibrium’ can be avoided consequently, and it achieves the overall optimization called ‘social maximum’ [16].
3 Results
In order to show that the TOW Bombe certainly avoids the Nash equilibrium and regularly achieves an overall optimization, we consider a case where (, , , , ) (, , , ,
) as a typical example. A part of the payoff tensor that has
(=) elements is described as follows for simplicity; only matrix elements for which each user does not choose lowranking and are shown (Table 2, 3, and 4). For each matrix element, the reward probabilities are given in the order of users 1, 2, and 3.2: C  2: D  2: E  

1: C  , ,  , ,  , , 
1: D  , ,  , ,  , , SM 
1: E  , ,  , , SM  , , 
2: C  2: D  2: E  

1: C  , ,  , ,  , , SM 
1: D  , ,  , ,  , , 
1: E  , , SM  , ,  , , 
2: C  2: D  2: E  

1: C  , ,  , , SM  , , 
1: D  , , SM  , ,  , , 
1: E  , ,  , ,  , , NE 
‘Social maximum (SM)’ is a state in which the maximum amount of total reward sum is obtained by all the users. In this problem, the social maximum corresponds to a ‘segregation state’ in which the users choose top three different machines () respectively; there exist six segregation states that are indicated by SM in the Tables. On the other hand, the Nash equilibrium (NE) is a state in which all the users choose machine independently of others’ decisions; machine gives the reward with the highest probability when each user behaves in a selfish manner.
The performance of the TOW Bombe was evaluated by a score: the number of rewards (coins) a user obtained in his (her) plays. In cognitive radio, the score corresponds to the amount of packets that has successfully transmitted. Figure 4 shows the scores of the TOW Bombe in the typical example where (, , , , ) (, , , , ). Since samples were used, there are circles for each data. Each circle indicates the score obtained by user (horizontal axis) and user (vertical axis) for one sample.
There exist six clusters in Figure 4. These clusters correspond to the two dimensional projections of the six segregation states, implying the overall optimization. The social maximum points are given as follows: (the score of user , the score of user , the score of user ) (, , ), (, , ), (, , ), (, , ), (, , ), and (, , ). The TOW Bombe did not reach the Nash equilibrium state (, , ).
Figure 5 shows sample averages of the scores until plays, where we showed the average of each user’s score and that of the total score of all the users.
4 Conclusion and Discussion
We demonstrated that an analog decisionmaking device, called the TOW Bombe, is implemented physically by using two kinds of incompressiblefluid in coupled cylinders and achieves overall optimization in the channel allocation problem in cognitive radio. The TOW Bombe enables to solve the allocation problem for users and channels by repeating upanddown operations of the fluid interface levels in the cylinders at each iteration; it does not require the calculation of exponentiallymany () evaluation values that are required when using a conventional digital computer. This suggests that an advantage of analog computation do exist even in today’s digital age.
The TOW Bombe can also be implemented on the basis of quantum physics. In fact, the authors have exploited optical energy transfer dynamics between quantum dots to construct the decisionmaking device [17, 18]. Our method may be applicable not only to a class problem derived from cognitive radio but also to broader varieties of game payoff matrices, implying that wider applications are expected. We will report these observations and results elsewhere in the future.
Acknowledgement
This work was partially undertaken when the authors belonged to the RIKEN Advanced Science Institute, which was reorganized and integrated into RIKEN as of the end of March, 2013. We thank Prof. Masahiko Hara and Dr. Etsushi Nameda for valuable discussions and advice. We are grateful to Dr. Makoto Naruse at National Institute of Information and Communications Technology and Prof. Hirokazu Hori at University of Yamanashi for useful argument about the TOW Bombe and its quantum extension.
References
 [1] S. J. Kim, M. Aono, and M. Hara, “Tugofwar model for multiarmed bandit problem” in Unconventional Computation, Lecture Notes in Computer Science edited by C. Calude, et al. (Springer, 2010), Vol. 6079, pp. 69–80.
 [2] S. J. Kim, M. Aono, and M. Hara, “Tugofwar model for the twobandit problem: Nonlocallycorrelated parallel exploration via resource conservation”, BioSystems Vol. 101, pp. 29–36, 2010.
 [3] S. J. Kim, E. Nameda, M. Aono, and M. Hara, “Adaptive tugofwar model for twoarmed bandit problem”, Proc. of NOLTA2011, pp. 176–179, 2011.
 [4] S. J. Kim, M. Aono, E. Nameda, and M. Hara, “Amoebainspired tugofwar model: Toward a physical implementation of an accurate and speedy parallel search algorithm”, Technical Report of IEICE (CCS2011025), pp. 36–41 [in Japanese], 2011.
 [5] M. Aono, S. J. Kim, M. Hara, and T. Munakata, “Amoebainspired tugofwar algorithms for exploration–exploitation dilemma in extended bandit problem”, BioSystems Vol. 117, pp. 1–9, 2014.
 [6] S. J. Kim and M. Aono, “Amoebainspired algorithm for cognitive medium access”, NOLTA, IEICE, Vol. 5, No. 2, pp. 198–209, 2014.
 [7] P. Auer, N. CesaBianchi, and P. Fischer, “Finitetime analysis of the multiarmed bandit problem”, Machine Learning Vol. 47, pp. 235–256, 2002.
 [8] L. Lai, H. Jiang, and H. V. Poor, “Medium access in cognitive radio networks: a competitive multiarmed bandit framework”, Proc. of IEEE 42nd Asilomar Conference on Signals, System and Computers, pp. 98–102, 2008.
 [9] L. Lai, H. E. Gamal, H. Jiang, and H. V. Poor, “Cognitive medium access: exploration, exploitation, and competition”, IEEE Trans. on Mobile Computing Vol. 10, No. 2, pp. 239–253, 2011.
 [10] D. Agarwal, B. C. Chen, and P. Elango, “Explore/exploit schemes for web content optimization”, Proc. of ICDM2009, http://dx.doi.org/10.1109/ICDM.2009.52, 2009.

[11]
L. Kocsis and C. Szepesvri, “Bandit based montecarlo planning” in
17th European Conference on Machine Learning, Lecture Notes in Artificial Intelligence
edited by J. G. Carbonell, et al. (Springer, 2006), Vol. 4212, pp. 282–293.  [12] S. Gelly, Y. Wang, R. Munos, and O. Teytaud, “Modification of UCT with patterns in MonteCarlo Go”, RR6062INRIA, pp. 1–19, 2006.
 [13] D. Davies, “The Bombe – a remarkable logic machine”, Cryptologia Vol. 23 No. 2, pp. 108–138, 1999, doi:10.1080/0161119991887793.
 [14] S. J. Kim, M. Aono, and E. Nameda,“Efficient decisionmaking by volumeconserving physical object”, http://arxiv.org/abs/1412.6141.
 [15] S. J. Kim, M. Naruse, and M. Aono, “TugofWar Bombe” (submitted).
 [16] T. Roughgarden, Selfish routing and the price of anarchy, The MIT Press, Cambridge, 2005.
 [17] S. J. Kim, M. Naruse, M. Aono, M. Ohtsu, and M. Hara,“Decision maker based on nanoscale photoexcitation transfer”, Scientific Reports Vol. 3, 2370, 2013.
 [18] M. Naruse, W. Nomura, M. Aono, M. Ohtsu, Y. Sonnefraud, A. Drezet, S. Huant, and S. J. Kim,“Decision making based on optical excitation transfer via nearfield interactions between quantum dots”, J. Appl. Phys. Vol. 116, 154303, 2014.