Consider two slot machines. Both machines have individual reward probabilitiesand . At each trial, a player selects one of machines and obtains some reward, for example, a coin, with the corresponding probability. The player wants to maximize the total reward sum obtained after a certain number of selections. However, it is supposed that the player does not know these probabilities. The multi-armed bandit problem (MBP) is to determine the optimal strategy for selecting the machine which yields maximum rewards by referring to past experiences.
In our previous studies [1, 2, 3, 4, 5, 6], we have shown that our proposed algorithm called the Tug-of-War (TOW) dynamics is more efficient than other well-known algorithms such as the modified -greedy algorithm and modified softmax algorithm, and comparable to the ‘upper confidence bound1-tuned (UCB1T) algorithm’ that is known as the best algorithm among parameter-free algorithms . Moreover, the TOW dynamics effectively adapts to a changing environment in which the reward probabilities dynamically switch. The algorithms for solving the MBP are useful for various applications, such as the cognitive radio [8, 9], web advertising , and the Monte-Carlo tree search that is used for programming computers to play ‘game of GO’ [11, 12].
Recently, the cognitive medium access problem is one of the hottest topics in the field of mobile communications [8, 9]. The underlying idea is to allow unlicensed users (i.e., cognitive users) to access the available spectrum when the licensed users (i.e., primary users) are not active. The cognitive medium access is a new medium access paradigm in which the cognitive users should not interfere with the licensed users. To avoid interfering with the primary network, the cognitive users must first probe to determine whether there are primary activities in each channel before transmission.
Figure 1 shows the channel model proposed by Lai et al. [8, 9]. There is a primary network consisting of channels, each with bandwidth B. The users in the primary network are operated in a synchronous time-slotted fashion. It is assumed that, at each time slot, channel is free with probability . The cognitive users do not know a priori.
At each time slot, the cognitive users attempt to exploit the availability of channels in the primary network by sensing the activity in this channel model. In this setting, a single cognitive user can access only a single channel at any given time. The problem is to derive an optimal accessing strategy for choosing channels that maximizes the expected throughput obtained by the cognitive user. This situation can be interpreted as the multi-user competitive bandit problem (CMBP).
For simplicity, we consider the minimum CMBP, i.e., 2 cognitive (unlicensed) users (1 and 2) and 2 channels ( and ). Each channel is not occupied by primary (licensed) users with the probability . In the MBP context, we assume that the user accessing a free channel can get some reward, for example a coin, with the probability . Table 1 shows the payoff matrix for user 1 and 2.
|user 2: A||user 2: B|
|user 1: A||()||()|
|user 1: B||()||()|
When two cognitive users select the same channel, the collision occurs, and the reward is evenly split between the collided users.
In order to develop a unified framework for the design of efficient, and low complexity, cognitive medium access protocols, we have to seek an algorithm that can obtain the maximum total rewards (scores) in the CMBP context. In order to acquire the maximum total rewards, the algorithm has to have a mechanism that can avoid the ‘Nash equilibrium’ which is the natural consequence for a group of independent selfish users.
In this study, we demonstrate that overall optimization (the maximum total rewards) can be derived by using a physical device consisting of two kinds of incompressible-fluid in two or more cylinders. We call this analog computing device the ‘Tug-of-War (TOW) Bombe’ because it is analogous to the ‘Turing Bombe,’ which is an analog electric circuit developed by the British army during World War II for decoding the ‘enigma code’ of the German army . If one tries to solve the CMBP for users and channels using a conventional digital computer, it is necessary to calculate evaluation values of for each iteration; the computational cost for solving the CMBP grows as an exponential function of and . Nevertheless, the TOW Bombe enables to solve the problem without paying the exponential computational cost. At each iteration, the TOW Bombe only requires up-and-down operations for controlling the fluid interface levels in the corresponding cylinders.
1 The Tug-of-War Dynamics
Consider incompressible-fluid in a cylinder, as shown in Fig. 2. Here, variable corresponds to the displacement of terminal from an initial position, where . If is greater than , we consider that the liquid selects machine .
We used the following estimate():
Here, is the number of playing machine until time , and is the number of non-rewarded (i.e., failed) events in until time , where is a weighting parameter.
The displacement () is determined by the following difference equation:
Here, is an arbitrary fluctuation to which the liquid is subjected. Consequently the TOW dynamics evolve according to a particularly simple rule: in addition to the fluctuation, if machine is played at each time , and are added to when rewarded and non-rewarded, respectively (Fig. 2). The authors have shown that this simple dynamics gains more rewards (coins or packet transmissions) than that obtained by other popular algorithms for solving the MBP [1, 2].
1.1 The Tug-of-War Principle
In this subsection, we derive the learning rules of the TOW dynamics from a thought experiment, so that we can obtain the nearly optimal weighting parameter . In many popular algorithms such as -greedy algorithm, an estimate for reward probability is updated only in a selected arm. In contrast, we consider the case that the sum of the reward probabilities is given in advance. Then, we can update both estimates simultaneously as follows,
Here, the top and bottom rows give the estimates based on times selecting A and times selecting B, respectively.
Each expected reward based on times selecting A and times selecting B is given as follows,
Here, is if is , or if is . These expected rewards s are not the same as the learning rules of the TOW, s in Eq.(1). However, the following difference is directly used in the TOW,
When we transform the expected rewards s into
we can obtain the difference
Eventually, we can obtain the nearly optimal weighting parameter in terms of .
This derivation means that the TOW has an equivalent learning rule with the system that is able to update both of the two estimates simultaneously. The TOW can imitate the system that determines its next moves at time in referring to the estimate of each machine even if it was not selected at time , as if the two machines were selected simultaneously at time . This unique feature in the learning rule is one of origins of the high performance of the TOW.
We carried out Monte Carlo simulations and confirmed that the performance of the TOW with is comparable to its best performance, i.e., TOW with . Detailed descriptions on these results will be presented elsewhere . In addition, the essence of the process described here can be generalized to -machine and -player cases. All we need is the following :
Here, denotes the top -th reward probability. In fact, for -machine and -player cases, we have designed a physical decision-making device that achieves the overall optimal state quickly and accurately .
2 The Tug-of-War Bombe
The decision-making device called the ‘Tug-of-War (TOW) Bombe’ for 3 users (, and ) and 5 channels (, and ) is illustrated in Figure 3.
Two kinds of incompressible-fluid (red and blue) are filled in coupled cylinders. The red (bottom) fluid handles the ‘decision-making of a user’, while the blue (upper) one handles the ‘interaction among users’. Channel selection of each user at each iteration is determined by the height of a green adjuster (a fluid interface level), and the highest channel is chosen. When the movements of red and blue adjusters stabilize to reach equilibrium, the ‘tug-of-war principle’ in red fluid holds for each user. In other words, when one interface goes up, other four interfaces fall down, and efficient channel selections are attained. Simultaneously, the ‘action-reaction law’ is held by blue fluid (i.e., if the interface level of user1 goes up, the interface levels of user2 and 3 fall down), which contributes to avoid collisions, and the TOW Bombe is able to search for an overall optimization solution accurately and quickly.
The dynamics of the TOW Bombe are expressed as follows:
Here, denotes the height of the interface of user and channel at iteration step . If channel is chosen for user at time , is or according to the result (rewarded or not). Otherwise, it is .
In addition to the above-mentioned dynamics, oscillations are added to . These oscillations are given from the external by controlling the blue and red adjusters appropriately. In this paper, we show the cases where the completely-synchronized oscillations are added to all the users,
Here, , , .
Thus, the TOW Bombe operates only by adding an operation which goes up or down the interface level ( or ) according to the result (success or failure of packet transmission) for each user (total times) at every time. After these operations, the interface levels move according to the volume conservation law, and it calculates next selection for each user. In the each user’s selection, an efficient search is realized due to the ‘TOW principle’ which can obtain a solution accurately and quickly in trial-and-error tasks. Moreover, by the interaction between users via blue fluid, the ‘Nash equilibrium’ can be avoided consequently, and it achieves the overall optimization called ‘social maximum’ .
In order to show that the TOW Bombe certainly avoids the Nash equilibrium and regularly achieves an overall optimization, we consider a case where (, , , , ) (, , , ,
) as a typical example. A part of the payoff tensor that has(=) elements is described as follows for simplicity; only matrix elements for which each user does not choose low-ranking and are shown (Table 2, 3, and 4). For each matrix element, the reward probabilities are given in the order of users 1, 2, and 3.
|2: C||2: D||2: E|
|1: C||, ,||, ,||, ,|
|1: D||, ,||, ,||, , SM|
|1: E||, ,||, , SM||, ,|
|2: C||2: D||2: E|
|1: C||, ,||, ,||, , SM|
|1: D||, ,||, ,||, ,|
|1: E||, , SM||, ,||, ,|
|2: C||2: D||2: E|
|1: C||, ,||, , SM||, ,|
|1: D||, , SM||, ,||, ,|
|1: E||, ,||, ,||, , NE|
‘Social maximum (SM)’ is a state in which the maximum amount of total reward sum is obtained by all the users. In this problem, the social maximum corresponds to a ‘segregation state’ in which the users choose top three different machines () respectively; there exist six segregation states that are indicated by SM in the Tables. On the other hand, the Nash equilibrium (NE) is a state in which all the users choose machine independently of others’ decisions; machine gives the reward with the highest probability when each user behaves in a selfish manner.
The performance of the TOW Bombe was evaluated by a score: the number of rewards (coins) a user obtained in his (her) plays. In cognitive radio, the score corresponds to the amount of packets that has successfully transmitted. Figure 4 shows the scores of the TOW Bombe in the typical example where (, , , , ) (, , , , ). Since samples were used, there are circles for each data. Each circle indicates the score obtained by user (horizontal axis) and user (vertical axis) for one sample.
There exist six clusters in Figure 4. These clusters correspond to the two dimensional projections of the six segregation states, implying the overall optimization. The social maximum points are given as follows: (the score of user , the score of user , the score of user ) (, , ), (, , ), (, , ), (, , ), (, , ), and (, , ). The TOW Bombe did not reach the Nash equilibrium state (, , ).
Figure 5 shows sample averages of the scores until plays, where we showed the average of each user’s score and that of the total score of all the users.
4 Conclusion and Discussion
We demonstrated that an analog decision-making device, called the TOW Bombe, is implemented physically by using two kinds of incompressible-fluid in coupled cylinders and achieves overall optimization in the channel allocation problem in cognitive radio. The TOW Bombe enables to solve the allocation problem for users and channels by repeating up-and-down operations of the fluid interface levels in the cylinders at each iteration; it does not require the calculation of exponentially-many () evaluation values that are required when using a conventional digital computer. This suggests that an advantage of analog computation do exist even in today’s digital age.
The TOW Bombe can also be implemented on the basis of quantum physics. In fact, the authors have exploited optical energy transfer dynamics between quantum dots to construct the decision-making device [17, 18]. Our method may be applicable not only to a class problem derived from cognitive radio but also to broader varieties of game payoff matrices, implying that wider applications are expected. We will report these observations and results elsewhere in the future.
This work was partially undertaken when the authors belonged to the RIKEN Advanced Science Institute, which was reorganized and integrated into RIKEN as of the end of March, 2013. We thank Prof. Masahiko Hara and Dr. Etsushi Nameda for valuable discussions and advice. We are grateful to Dr. Makoto Naruse at National Institute of Information and Communications Technology and Prof. Hirokazu Hori at University of Yamanashi for useful argument about the TOW Bombe and its quantum extension.
-  S. -J. Kim, M. Aono, and M. Hara, “Tug-of-war model for multi-armed bandit problem” in Unconventional Computation, Lecture Notes in Computer Science edited by C. Calude, et al. (Springer, 2010), Vol. 6079, pp. 69–80.
-  S. -J. Kim, M. Aono, and M. Hara, “Tug-of-war model for the two-bandit problem: Nonlocally-correlated parallel exploration via resource conservation”, BioSystems Vol. 101, pp. 29–36, 2010.
-  S. -J. Kim, E. Nameda, M. Aono, and M. Hara, “Adaptive tug-of-war model for two-armed bandit problem”, Proc. of NOLTA2011, pp. 176–179, 2011.
-  S. -J. Kim, M. Aono, E. Nameda, and M. Hara, “Amoeba-inspired tug-of-war model: Toward a physical implementation of an accurate and speedy parallel search algorithm”, Technical Report of IEICE (CCS-2011-025), pp. 36–41 [in Japanese], 2011.
-  M. Aono, S. -J. Kim, M. Hara, and T. Munakata, “Amoeba-inspired tug-of-war algorithms for exploration–exploitation dilemma in extended bandit problem”, BioSystems Vol. 117, pp. 1–9, 2014.
-  S. -J. Kim and M. Aono, “Amoeba-inspired algorithm for cognitive medium access”, NOLTA, IEICE, Vol. 5, No. 2, pp. 198–209, 2014.
-  P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem”, Machine Learning Vol. 47, pp. 235–256, 2002.
-  L. Lai, H. Jiang, and H. V. Poor, “Medium access in cognitive radio networks: a competitive multi-armed bandit framework”, Proc. of IEEE 42nd Asilomar Conference on Signals, System and Computers, pp. 98–102, 2008.
-  L. Lai, H. E. Gamal, H. Jiang, and H. V. Poor, “Cognitive medium access: exploration, exploitation, and competition”, IEEE Trans. on Mobile Computing Vol. 10, No. 2, pp. 239–253, 2011.
-  D. Agarwal, B. -C. Chen, and P. Elango, “Explore/exploit schemes for web content optimization”, Proc. of ICDM2009, http://dx.doi.org/10.1109/ICDM.2009.52, 2009.
L. Kocsis and C. Szepesvri, “Bandit based monte-carlo planning” in
17th European Conference on Machine Learning, Lecture Notes in Artificial Intelligenceedited by J. G. Carbonell, et al. (Springer, 2006), Vol. 4212, pp. 282–293.
-  S. Gelly, Y. Wang, R. Munos, and O. Teytaud, “Modification of UCT with patterns in Monte-Carlo Go”, RR-6062-INRIA, pp. 1–19, 2006.
-  D. Davies, “The Bombe – a remarkable logic machine”, Cryptologia Vol. 23 No. 2, pp. 108–138, 1999, doi:10.1080/0161-119991887793.
-  S. -J. Kim, M. Aono, and E. Nameda,“Efficient decision-making by volume-conserving physical object”, http://arxiv.org/abs/1412.6141.
-  S. -J. Kim, M. Naruse, and M. Aono, “Tug-of-War Bombe” (submitted).
-  T. Roughgarden, Selfish routing and the price of anarchy, The MIT Press, Cambridge, 2005.
-  S. -J. Kim, M. Naruse, M. Aono, M. Ohtsu, and M. Hara,“Decision maker based on nanoscale photo-excitation transfer”, Scientific Reports Vol. 3, 2370, 2013.
-  M. Naruse, W. Nomura, M. Aono, M. Ohtsu, Y. Sonnefraud, A. Drezet, S. Huant, and S. -J. Kim,“Decision making based on optical excitation transfer via near-field interactions between quantum dots”, J. Appl. Phys. Vol. 116, 154303, 2014.